Attorney Docket No.: LXGN-00104 



PATENT 



13. (Cancelled). 

14. (Cancelled). 

15. (Cancelled). 

16. (Amended) The method of claim 2, wherein said repeat sequences are postulated based 
upon amino acid sequences. 

17. (Cancelled). 

Claims 10-15 and 17 have been canceled. Claims 5-7 and 16 have been amended. Claims 
2, 3, 5-9, 16, 18-33, and 39 remain in the case. 

§ 112 Rejections 

Claim 17 is rejected by the Examiner under 35 U.S.C. 1 12, first paragraph, as failing to 
comply with the written description requirement. Claim 17 has been canceled, and Applicant 
requests this basis for rejection be removed from the case. 

Claims 5-16 are rejected by the Examiner under 35 C.S.C. 112, second paragraph, as 
being indefinite for failing to particularly point out and distinctly claim the subject matter which 
applicant regards as the invention. Specifically the Examiner states that claims 5-16 are 
indefinite for recitation of the phrase "said sequences" because it is not clear which of the 
sequences in the claims from which claims 5-16 depend the phase refers to. Claims 5-16 have 

4 



Attorney Docket No.: LXGN-00104 PATENT 

been amended to address the Examiner*s comments. Specifically, each claims that requires it 
(claims 5-7 and 16) has been amended to specifically reference either the "repeat sequences" or 
the query sequence." These amendments fully address the Examiner's basis of rejection and 
Applicant requests the basis be removed from the case. 

§ 103 Rejections 

The Examiner rejects claims 2, 3, 5, 7, 8, 18-20, 27 and 30 under 35 U.S.C. 103(a) as 
being unpatentable over Jurka et al. (1996). 

The Examiner takes the position that it would have been obvious to a person of ordinary 
skill in the art at the time the invention was made to modify the method of Jurka et al. (1996) by 
addition of newly determined repeat sequences to a repeat sequence database so that the repeat 
sequence database would be a more comprehensive listing of repeat sequences. 

The Examiner also rejects claims 2, 6, 15, 16, 19-24, 26-29, and 31-33 under 35 U.S.C. 
103(a) as being unpatentable over Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, 
and 30, and further in view of Altschul et al. The Examiner argues that it would have been 
obvious to a person of ordinary skill in the art at the time the invention was made to modify the 
method of Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 by use of 
analysis of ribonucleotide sequences, sequences that encode amino acid sequences, repeat 
sequence databases accessible through the internet, use of public domain databases GenBank, 
dbEST, and SwissProt, use of search algorithms BLAST and FASTA, and use of scoring 
matrices PAM and BLOSUM because Altschul et al. shows use of all of those features in the 

5 



Attorney Docket No.: LXGN-00104 PATENT 

context of searching sequence databases with query sequences whose repeat sequences have 
been masked. 



The Examiner also rejects claims 2, and 7-14 under 35 U.S.C. 103(a) as being 
unpatentable over Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, 
and further in view of Jurka (1998). The Examiner takes the position that it would have been 
obvious to a person of ordinary skill in the art at the time the invention was made to modify the 
method of Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 by use of repeat 
sequences from a variety of organisms so that corresponding query sequences from the 
organisms could be analyzed and masked. 

Claims 2, 22, and 25 are rejected by the Examiner under 35 U.S.C. 103(a) as being 
unpatentable over Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, 
and further in view of Sohocki et al. According to the Examiner's reasoning, it would have been 
obvious to a person of ordinary skill in the art at the time the invention was made to modify the 
method of Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above by use of 
TIGR Human Gene Index database because Sohocki et al. shows that the database is a useful 
source of human genes such as genes related to inherited retinal disorders. 



6 



Attorney Docket No.: LXGN-00104 PATENT 

Applicant's Response to § 103 Rejections 

First, Claim 39, which claim remains in the case and which claim has not been rejected 
under any § 103 basis would appear to be allowable. Applicant requests that the case be allowed 
at a minimum with claim 39 surviving. 

The reminder of the claims stand rejected as noted above under § 103 chiefly our Jurka 
(1996), alone in combination with Altschul, Jurka (1998), and Shohocki. 

The cited chief reference of Jurka (1996) fails to teach at least one of the key inventive 
points of the present invention. This failure in teaching is not cured by or obvious over any of the 
secondary references cited. All of the art cited deals with taking an "unknown" sequence and 
querying it against a known sequence database to see where it fits in the broader sequence 
picture (a typical search against a database of "known" things and then categorize and report the 
results). In doing so, all of the cited art teaches away from the present invention as noted below. 

Conversely to Jurka (1996) and the cited secondary art, the present invention teaches the 
computer how to deal with hoardes of random snippets of DNA sequence information, assemble 
them into contigs, and during the process "leam" how to identify and "mask" novel repetitive 
elements (which otherwise greatly confuse the assembly), and then reassemble the data via an 
iterative "learning" process that identifies new repeats and "remembers" to delete them from the 
subsequent assembly (by adding them to the "known" repeat/masking database). 

In fact, the type of searching and processing described in Jurka (1996) and the secondary 
cited art is clearly intended to be performed BEFORE the presently claimed invention/program is 
applied (i.e., each of the sequences is scanned against a "known" repeat database and all known 
sequences are masked prior to the sequence being used in the assembly). In addition, as directly 

7 



18069302 



Pergamon 



(K»7-«485(95)00023-2 



Computers Chem. Vol. 20, No. 1. pp. H9-I21. 1996 
Copyright © 1996 Ebcvkr Science Ltd 
Printed in Great Britain. All rights reserved 
0097-«48S/96 S15.00+0.X 



SOFTWARE NOTE 

CENSOR— A PROGRAM FOR IDENTIFICATION AND 
ELIMINATION OF REPETITIVE ELEMENTS FROM 
DNA SEQUENCES 

JERZY JURKA,* PAUL KLONOWSKI, VADIM DAGMAN and 

PAUL PELTON 

Linus Pauling Institute of Science and Medicine, 440 Page Mill Road, Palo Alto, CA 94306, U.S.A. 

(Received 19 October 1994: ir. revised fcrtr, 10 Fcbr^ry 1905} 

Abstrftct—CENSOR is a program designed to identify and eliminate fragments of DNA sequences 
homologous to any chosen reference sequences, in particular to repetitivte elements. CENSOR is based on 
two principal algorithms of Smith & Waterman ( 1 98 1 ) [7. Mot. Biol. 147, 1 95] and Wilbur & Lipman ( 1 983) 
[Proc. Natl Acad. Sci, U,S,A . 80, 726]. It includes several pre-set sensitivity levels based on both biological 
and statistical criteria which help to distinguish between aligned pairs of homologous and non-homologous 
sequences. CENSOR has been implemented in C/C+ + in the SUN/UNIX environment. 



INTRODUCTION 

Repetitive sequences are very abundant in eukaryotic 
genomes and, inevitably, almost every researcher 
working on newly sequenced eukaryotic DNA must 
deal with them. Usually, repetitive sequences are 
eliminated prior to GenBank/EMBL database 
searches. However, repeats are increasingly more 
often annotated and studied in the sequence context as 
integral components of the genetic material. 
Annotation and basic studies of repetitive DNA at the 
sequence level require specialized databases and 
computer software. A preliminary reference collection 
of human repeats and an on-line software for 
identification of repeats based on minimum length 
encoding method (PYTHIA) have been published 
before (Jurka et al,, 1992). The reference collection 
continues to be updated and released electronically via 
National Center for Biotechnology Information 
(NCBI repository). The identification of repeats by 
PYTHIA is based on the alignment of a sequence 
under investigation against the reference collection of 
human repetitive elements without automatic 
elimination of the identified repeats in the analyzed 
sequence. Furthermore, aligimiient based on minimum 
length encoding method is relatively CPU-intensive 
which imposes significant limitations on its 
widespread usage. This prompted us to develop a new, 
more efficient and user-friendly program, called 
"CENSOR", based on recently described principles 
for identification and analysis of repetitive DNA 
(lurka, 1994). Related software has recently been 
described by other authors (Claverie & States, 1993; 
Altschul et al., 1994; Claverie, 1994; Quentin & 



* Corresponding author. 



Finchant, 1994). This welcomed development is likely 
to improve the quality of sequence data analysis in 
coming years. 

PROGRAM DESCRIPTION 

The basic steps implemented in CENSOR involve 
rapid comparison and alignment of reference 
sequences with a sequence under study, followed by 
replacement of homologous fragments by asterisks in 
the studied target sequence. The latter procedure is 
called 'censoring* (Jurka, 1994), and was first applied 
in studies of medium reiteration frequency (MER) 
repeats (Jurka, 1990). The CENSOR front-end 
interface permits to run DASHER3 (Faulkner, 1987) 
for fast sequence comparisons (Wilbiu- & Lipman, 
1983). Following the fast search is the crucial step of 
LOCAL alignment (Smith & Waterman, 1981) and 
subsequent evaluation and eliminadon of homologous 
sequences. 

The censoring procedure has recently been 
implemented in XBLAST under the name of 'masking* 
(Claverie & States, 1993; Altschul et aL, 1994; 
Claverie, 1994). Overall, CENSOR appears to be 
slower than XBLAST, but it is recommended over 
XBLAST whenever older and more diverse repeats are 
being searched. Furthermore, CENSOR uses the ratio 
of mismatches to transitions (see Jurka, 1994), in 
combination with alignment and similarity scores, to 
distinguish true homology from accidental similarity 
between sequences. 

To start the program, one simply types *censor' at 
the prompt sign. The main menu allows the user to 
choose and set various options for running CENSOR. 
The user can choose to run DASHER3, or proceed 
directly with LCXTAL alignment to assure maximum 



119 



18069302 



Jcrzy Jurka ei al. 



no 

sensitivity. The menu also provides a choice of using 
any one of the pre-set sensitivity options for sequence 
comparison or change any number of individual 
parameters as the user sees fit. These parameters 
are defined under the help menu option, along 
with additional instructions for running CENSOR. 
The remaining options in the main menu start the 
actual censoring process or restart an accidentally 
interrupted run. 



As indicated above, the user must supply two input 
files to run CENSOR. One of them is a reference file 
and the other the studied target sequence. For 
identification and elimination of repetitive DNA 
one should use repetitive elements as a reference 
file. A reference collection of human repeats has 
been described before (Jurka et aL, 1992) and its 
expanded and updated version is available from the 
NCBI repository. Reference collections of repetitive 



Seqaanoa allynmat (local. out) 

uurpuc file xocnuic • 

liOCOSl Kl N2 IOC0S2 Ml M2 Fl F2 F3 I. S # 



. . , aligned fragments . . . 

. . . statistics line from original file . . . 

where lll,N2rMl,M2 - aligned fragments boundaries 

Fl - (no. of Matches)/<no. of Matches + no. of Mismatches + no. of Gaps), 
F2 - (Number of Gaps )/ (Number of Misroatcliea > r 

which is set to 0 if (Number of Mismatches) 0, 
F3 - (Nuniber of Mismatches) / (Number of Transitions), 

which is set to 1 if (Number of Transitions) 0 

and to 100 if both nundaers 0, 
L - Length of the top sequence fragment , 
S - Local Score. 



Local parameters: 

Margin - 150 

Similarity treshold - 30.00 

Ratio treahold - 2.00 



ALUei 75 138 XYZ 30 93 0.94 0.00 4.00 €4 55.40 # 

AAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGC 

** ******«******4r***********4******** I ****************** **** 

AATTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCATCGCTACTAAAAATACAAAAAATAGC 

& Containing 59 matches, 0 gaps and 4 mismatches including 1 
transitions 



Caasorad output (asap.out) 

; ID XYZ 

;D£ DNA SEQUENCE 
XYZ 

TTTTCATACTCCCAGGCAGGGACGTTCCT** ********************************* ****** 
*********************** X ACTAGC 1 



CllAinat«d seqpiofteM (plo.out) 

; LOCUS XYZ 

;DE DNA SEQUENCE 

; ALGNLOCUS XYZ ALUS 1 

; FRAQ1ENT 30 -> 93 

XYZ 

AATTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCATCGCTACTAAAAATACJVAAAAATAGCl 



Fig. !. An example of output files from CENSOR. 



18069302 



sequences for other species are also available through 
the NCBI server and will be described in detail 
elsewhere. The sequences in the input files must be in 
the IG/Stanford format as previously described 
(Faulkner & Jurka, 1988). To distinguish between 
direct and complementary sequences, loci names 
should end with '@r and '@2\ This labeling permits 
the option to automatically reverse and complement 
sequences before they are stored in *plc.out\ 
Two formatting programs are distributed with 
CENSOR. 

Our analysis indicates that pending specific cases 
DASHER3 may be more or less sensitive than other 
programs for fast search (Pearson & Lipman, 1988). 
Therefore, the user should use direct LOCAL 
aligiunent option to verify the original output. This 
option is reconunended only for files pre-censored 
using the fast search. Using the approximate list of 
matches from fast search, the reference sequences are 
subsequently aligned with sequences under study 
using the LOCAL algorithm (Smith & Waterman, 
198 1 ) and the homologous fragments are censored out. 

CENSOR generates three final output files outlined 
in Fig. 1. The alignment results are stored in 
'local.out\ The fragments homologous to the 
reference sequences are cut out and stored in *plc.out\ 
The censored sequences are written to 'asap.out' with 
asterisks in place of repeats. One can choose to use 
other ASCII characters in place of asterisks. The file 
'asap.out* can be renamed and rerun against the 
reference collection under different conditions for 
possible identification and censoring of more distant 
repeats. However, one should remember that 
non-homologous sequence fragments will increasingly 
be censored out as one moves towards higher 
sensitivity levels. There are five pre-sct sensitivity levels 
which contain built-in parameters for identification of 
similarities and for distinguishing true homologies 
from accidental similarities. Sensitivity of CENSOR 
can be adjusted by changing the 'window* size, as 
well as, cutoff thresholds for similarity scores in 
DASHER3 and alignment scores in LOCAL. The 
lower the scores, the more distant similarities are being 
reported by CENSOR. As indicated above, one can 
skip fast search by DASHERS and proceed directly 
with local alignment. To reduce false positives, the 
similarity is evaluated based on the biological fact that 
transitions, i.e. A < - > G and C < - > T mutations, 
are relatively more common than transversions. 
CENSOR uses the ratio of transversions to transitions 
(Jurka, 1994) or, equivalently, the ratio of mismatches 
to transitions in the aligned pair of sequences, where 
mismatches represent the sum of transitions and 



SCR 121 

transversions between the aligned sequences. The 
expected ratio of mismatches to transitions, referred to 
as 'ratio', for a random match is 3: 1. For the majority 
of searches a 'conservative option* (No. 3), is 
adequate. It includes low cut-off thresholds for 
DASHER3 (4.5), low ratio of mismatches to 
transitions (2:1) and a relatively high LOCAL 
alignment score (30.0). If LOCAL score exceeds 3S.0, 
all alignments are being reported by CENSOR 
irrespective of the ratio of mismatches to transitions. 
The next level of sensitivity (no 4) sets LOCAL score 
at 22.0 and the ratio at 2.8:1. Beginning with this level 
one has to evaluate the alignments using criteria other 
than those implemented in CENSOR. 

Program availability 

CENSOR is currently available via the National 
Center for Biotechnology Information ftp server 
(ncbi.nlm.nih.gov). Login as 'anonymous* and use 
your email address when asked for the password. The 
software package is deposited in the 'repbase/censor* 
directory. In addition to CENSOR, two pieces of 
formatting software are included. The first one, 
*embi2ig\ converts sequence files from EMBL format 
to IG/Stanford format and the sedond« 'compseq*, 
generates complementary sequences with properly 
labeled loci names. The formatting programs are 
menu-driven and ask for input and output files. All 
software has been implemented in C/C rf + under the 
SUN/UNIX environment. 

Acknowledgements — This work was supported by the U.S. 
Department of Energy, Human Genome Program Grant No. 
DE-FCj03-9IER61I52. 

REFERENCES 

Altschul S. F.. Boguski M. S., Gish W. & Wootton J. C. 

(1994) Nature Genet. ^ 1 19. 
Claverie J.-M. (1994) Automated DNA Sequencing and 

Analysis (Edited by Adams M. D., Fields C. & Venter 

J. C), p. 267. Academic Press, New York. 
Claverie J.-M. & States D. J. (1993) Compute Chem. 17, 191. 
Faulkner D. V. (1987) Unpublished data. 
Faulkner D. V. & Jurka J. (1988) TIBS 15* 321. 
Jurka J. (1990) NucleU Acids Res. IS, 137. 
Jurka J. (1994) Automated DNA Sequencing and Analysis 

(Edited by Adams M. D., Fields C. and Venter J. C), 

p. 294. Academic Press, New York. 
Jurka J., Walichiewicz J. & Milosavljevic A. (1992) J. MoL 

EvoL 35, 286. 

Pearson W. A Lipman D. J. (1988) Proc. Natl Acad. Sci. 

U,S.A. 85, 2444. 
Qucntin Y. & Finchant 0. (1994) 7. Theor. Biol. 166, 51. 
Smith T. F. & Waterman M. S. (1981)7. Afol. Biol. 147, 195. 
Wilbur W. J. A Lipman D. J. (1983) Proc. Natl Acad. Sci. 

t/.5.i4. 80, 726. 



'me' 




Stephen F. Altschul, Mark S. Boguski, Warren Gish & John C. Wootton 



Sequence similarity search programs are versatile tools for the molecular biologist, 
frequently able to identify possible DNA coding regions and to provide clues to gene and 
protein structure and function. While much attention had been paid to the precise 
algorithms these programs employ and to their relative speeds, there is a constellation 
of associated issues that are equally important to realize the full potential of these 
methods. Here, we consider a number of these issues, including the choice of scoring 
systems, the statistical significance of alignments, the masking of uninformative or 
potentially confounding sequence regions, the nature and extent of sequence 
redundancy in the databases and network access to similarity search services. 



44 



Nalional Center for 
Biotechnohgy 
Information, 
National Library of 
Medicine, National 
Institutes of HcaltK 
Beihesda, Maryland 
m94, VSA 

Correspondence 
should be addressed 
to MSB. 



The advent of rapid DNA sequencing technolog)^ in the 
mid-1970s led to an information explosion that continues 
unabated today. Molecular sequence data have become 
the common currency of biomedical research and often 
provide unexpected links among diverse biological 
systems. These connections accelerate research progress 
and may even open up entirely new fields of inquiry. One 
approach to discovering such connections, database 
"homology" searching, has been executed coundess times, 
often v^ith surprising results and has become an essential 
method for the molecular biologist. While the particular 
algorithm used is of course important, the effectiveness of 
database searches is dependent as well on a large number 
of correlative factors, many of which tend to be overlooked 
or dealt v/ith an an inefficient or ad hoc manner. These 
-include the following: 

Scoring systems. Most database search algorithms rank 
alignments by a score, whose calculation is dependent 
upon a particular scoring system . Usually there is a default 
system, but it may not be ideal for a user's particular 
problem. For example, haemoglobin subunits used to be 
regarded as "typical" proteins and are often still used as 
benchmark query sequences for evaluating new database 
search techniques and scoring systems. However today it 
is more common to encounter much larger and more 
complex sequences (see below) and methods developed 
and optimized for small, uniformly-conserved, single- 
domain proteins are inadequate. Scores that are best for 
detecting similarities between greatly diverged sequences 
differ from those best for detecting short but nearly 
identical segments'*^ Optimal strategies for detecting 
similarities between DNA protein-coding regions differ 
from those for non-coding regions^'^ Special scoring 



Nature Genetics volume 6 february 1994 



systems for detecting frame-shift errors in the databases 
have recently been described'. A database search program 
should therefore make a variety of scoring systems available 
and users should be aware of which ones are best suited to 
their problems. 

Alignment statistics. Given a query sequence, most 
database search programs will produce an ordered list of 
imperfectly matching database similarities, but none of 
them need have any biological significance. An important 
question is how strong a similarity is necessary to be 
considered sui*prising. United by a common theory, a 
number of analytic^' and empirical results^-^^'^ are now 
available for assessing database search results. However, 
one still sees occasional extravagant claims in the literature, 
usually springing either from misapplication of the normal 
distribution or from an absence of critical statistical 
analysis. 

Databases. The use of an up-to-date sequence database is 
clearly a vital element of any similarity search. Sequence 
relationships critical to important discoveries have on 
occasion been missed because old or incomplete databases 
were employed. However, die variety of databases available, 
and their overlapping coverage, has the potential to render 
similarity searching cumbersome and inefficient. This no 
longer need be the case. Timely access to complete and 
"nonredundant" sequence databases has become relatively 
simple and inexpensive. 

Database redundancy and sequence repetitiveness. 
Surprisingly strong biases exist in protein and nucleic acid 
sequences and sequence databases. Many of these reflect 
fundamental mosaic sequence properties that are of 
considerable biological interest in themselves, such as 
segments oflow compositional complexity or short-period 

119 



review 



Table 1 The BLAST family of programs 



Program* Query Database 
sequence sequences 



BLASTP protein 



protein 



BLASTN nucleotide nucleotide 
(both strands) 



BU\STX 



nucleotide 
(six-frame 
translation) 



protein 



TBLASTN protein 



Comments 

• Default scoring matrix*' is BLOSUM62; 
change with command line option 
"M=PAM250", for example . 

• Low-complexity masking with "-filter" 
option; choice of either the SEG^^ and XNU" 
algorithms 

• Parameters optimized for speed, not 
sensitivity; not intended for finding distantly- 
related, coding sequences 

• Automatically checks complementary strand 
of query 

• Very useful for preliminary data containing 
potential frameshift errors'^ 

• Nine different genetic codes available''; 
change with command line "C=1 " (vertebrate 
mitochondrial) for example 

• Low-complexity filter option as for BLASTP 

nucleotide • Essential for searching protein queries 
(six-frame against dbEST^° 

translations) • Often useful for finding undocumented open 
reading frames or frameshift errors in 
database sequences 

• Same genetic code options as for BLASTX 



"These programs are available through the BUVST Network and e-mail servers (see 
text) and the source codes are available by anonymous ftp on ncbi.nlm.nih.gov. 
"More than 65 different PAM'-^-^s-as-^ BLOSUM'^-'^ and other scoring matrices are 
available. PAM120 or BLOSUM62 are best for general purposes but a useful 
combination tor detecting strong and short to long and weak similarities ccnsisis of 
PAM30, PAM120 and PAM250 (ref. 2). 

'^Default genetic code (C=0) is "standard" or "universal" code. Other codes available 
include: 1, Vertebrate mitochondrial; 2, Yeast mitochondrial; 3, Mold mitochondrial 
and mycoplasma; 4, Invertebrate mitochondrial; 5, Ciliate macronuclear; 6, Protozoan 
mitochondrial; 7, Plant mitochondrial; and 8, Echinodermate mitochondrial. 



repeats. Databases also contain some very large families of 
related domains, motifs or repeated sequences, in some 
cases v/ith hundreds of members. In other cases there has 
been a historical bias in the molecules that have been 
chosen for sequencing. In practice, unless special measures 
are tal<:en, these biases very commonly confound database 
search methods and interfere v^ith the discovery of 
interesting new sequence similarities. Problems include 
the occurrence of misleading, spuriously-high scores, 
ambiguities in the phase of sequence alignments and 
overwhelmingly large output lists in which interesting 
results may be inconspicuously buried. We shall describe 
some recendy developed methods that largely solve these 
problems by automatically detecting and masking 
potentially confounding subsequences. 

Failure to deal properly with the factors described above 
can result in chance similarities being claimed significant, 
or biologically important relationships being overlooked. 
Here, we shall discuss these and several other issues in 
databa.<5e searching. V/hQe we will frequendy use the BLAST 
programs^-''' (Table 1 ) as examples, most of the questions 
considered have quire general relevance. 

Al^-jrithms and programs 

The earliest sequence comp: : son studies focussed on the 
alisr-. '/Ill* of coniplcle scni res However, with the 
rec^i;ni!!on that proiciiij fi ;uc--:ly -.hare only ^ f^idted 

) 20 



regions of similarity, corresponding for instance to 
structural motifs or active sites, attention shifted to 
algorithms for local alignment' ^2*. Essentially all databas- 
search methods have been based upon measures of local 
sequence similarity. 

In general, local alignments are assessed by means of a 
score, which is computed as the sum of scores for aligned 
pairs of residues and scores for gaps^^ How these scores 
are chosen, and what they signify, is discussed below. The 
time necessary to find ahgnments that optimize such 
scores is sufficiendy great that, for most practical purposes, 
either parallel architecture machines^^"^^ or heuristic 
methods such as Fasta^^-^^ are required. The problem mt ; 
be simplified by forbidding gaps. This leads to faster 
heuristic methods such as the BLAST algorithms*-^^ (Table 



-rc^; — * Ur.^A.. 



t Til orvi OT^t'f,f'I/\T->r'29 





While some sensitivity to weak similarities may be lost by 
eschewing gaps^°, easier generalization^' ' and rigorous 
statistical results^' become available. Alternatively, local 
alignments maybe assessed in a more sophisticated manner 
than by the simple sum of substitution and gap scores^-. 
This may lead to more sensitive detection of weak 
similarities, but at the price of greatly increase ' 
computation time". 

In general, the relevant considerations in choosing a 
particular algorithm are hardware requirements, speed 
and sensitivity to biological relationships. The tensions 
between these competing claims are resolved variously by 
programs such as Fasta^^ BLAST'"^ ajid Blaze^^ The relative 
merits of these and the other programs have been discussed 
at length elsewhere'^-". The idea of optimizing a measure 
of local similarity is common to virtually all popular 
programs, and the results they produce therefore do nr-t 
differ in any truly essential way. 

Local alignment statistics 

Not ail biologically important sequence relationships will 
be detected by sequence similarity search programs and, 
even when found, they may be lost among irrelevant or 
chance similarities. While experiment is the ultimate 
arbiter of biological significance, mathematical analysis 
can indicate which similarities are unlikely to have arisen 
by chance and therefore merit special attention. Thus ?n 
important question concerning alignments produced * y 
any database search is whether they can be considered 
statistically significant. 



0.4r- 




Fig. i The probabiP?y density function of the extrem.--. ^a\'J^^ 
distribution with ch.-.-dcleri tic value u=0 and decay 
constant /.-l. 



Natl • 'lenetics volume 6 february 19'^'' 



One approach sonieti nes taken is to record an optimal 
local alignment score lor each database sequence and 
then to report these scores as stand:-.; d deviations from 
the mean. There are several serious and frequently 
unrecognized pitfalls to this procedure. First, the optimal 
scores for the comparison ofaquer)' sequence to different 



database sequences can not be assumed to be drawn from 
the same distribution. The longer a given database 
sequence, the greater the score expected by chance. Also, 
variation a residue composition among sequences can 
yield different score distributions. Second, unless a 
rigorous optimization algorithm is employed, the true 




Vyalue-lcystri^itidi^ 

.vui3bri:ifhe Dari:ieul^p>i^6i"inq c^.' ,^ 




When*P<p:Jl-;/it-M wed approximated as sfrnpily^Dp; thi&^ppro&h>f1j|l<e^^^ a^sumption}f[iat4i! ' ■ 

sequeneesfin-;th9 are a pr/on equailyriikely to share some relatfenCS^fw^^ q?ue^iy/A]r^iternatiye vie'^^^ 

;^'basec|on^|tie^ pnoteln^jsegmehtsjS^ ^" 

'■;cJktat|si|are1a likely to bejielafeS^^he^que^^ 




£as:^6tible:i^giMia^ 




Nature Genetics volume 6 february 1994 




121 



optimal pairwise scores will be systematically 
underestimated and the. shape of the true distribution 
will be ill-determined. Third, comparing a query sequence 
to a set of uniform length random sequences yields scores 
that obey not a normal but an extreme va/we distribution 
(Box 1 and Fig. 1). The tail of this distribution decays 
exponentially in x rather than x^, so assuming normality 



tends grossly to exaggerate an alignment's significance. 
Finally, a database search involves many essentially 
independent trials. If the database contains 50,000 
sequences, a score with probability 10^ of arising from a 
single comparison is only marginally significant in the 
context of the complete search. The last two points alone 
imply that an alignment may easily achieve a score over 



E 
o 
o 

t 

SI 



CO 

a> 
E 

Ui 

o 
>» 

a. 



I 




-3 



CO 

Cj 

?^ 
DC 

s 
I 

Cl, 



■<^r LD O Ch Ul 

m ^ oo r-j 



un U3 rH O 
03 ro U) 



a 

t 

a 
a 



a 
a 

tr 
c 

r— I 
W 

tn 
o 

0) 



c 
w 

c 
a 
a 
w 
a 

nj 



a 
a 

•a 

c 
a 



01 

ui 

Ui 
Ui 
CO 

Di 
Dl 

cn 



a 

u 



Dl 

a 



C/J 

M 

a 

u 

03 

a 



0) Q 



^ i5 



fi. 
o 

CO 

o 



PC CL, 

M CO 

u a. 

Oi a: 

CO tu 

M CO 

Q X 

O Oi 

Q 



• 1 *—* 



0) 

o 

a 

CO 

O 











u 


DC 


CO 








1 




o 


< 




CO 






a. 










— o 


u 






o 




CO 


CO 




>< 












N:: 


o 






M 




O 






















O 




a: 


u: 






a 






M 






M 


a; 




1 


1 


CO 

a: 


a: 


lid 










ct: 








Du 










Cl. 






r'l 


is 






— — o 


u 


u 


u 


U 


u 










o 


u 


DO 




o 


w 


Cd 




o 




o 


u 


o 




CO 


6 


a 


f>] 
o 


1 








>; 


















C3 


§ 












o 


L 




CO 


CO 


CO 


CO 








< 


rj 




>^ 
a 


K 


o 


Q 


Q 


Q 


Q 


— — 0 

- -1 


O 

a 


U 


o 
o 


g 


O 
U 






M 




s 














O 




u 





CD 



C\J CD 
O 



Z) DC 

X a: 



< < 



00 

cr 
c 



CM 



£ 

CO 



O 

DL' 



cn 



O ^ . O UJ cv 



ten itar .:ard deviations from the mean yet fail to be 
statistically significant. 

Box 1 discusses the extreme value distribution and how 
it may be used to calculate the probability that a gap-free 
local ialignment with a given 'score wpiiild arise from the ■ 
comparison of two random sequences. It also describes 
ho wto modify this probability to account for the "multiple 
tests" of a database search. Such a search can itself generate 
data which provide an alternative to the analytic method 
(Box 1 ) for estimating alignment statistical significance'^ 
For a given query, one records the best alignment score to 
each database sequence. If score Sis observed/fSj times, 
then plotting log f(S) versus 5 tends to produce a straight 
line; extrapolation of this line can yield estimates of 
statistical significance 'I 

One advantage of this approach is that it is applicable 
to cases for which no rigorous theory is available, such as 
scores from gapped alignments. Thus heuristic programs 
such as Fasta^^ or parallel implementations of the Smith- 
Waterman algorithm'^ such as Blaze^^ or Blitz2^ can 
estimate statistical significance using this method. 
Furthermore, because the scores generated derive from 
comparisons of real sequences, no "random protein" 
model is needed. A disadvantage of the method is the 
need to generate optimal alignment scores for a substantial 
fraction of database seqiiences in order to calculate 
statistical significance. Potcntii^l inaccuracy arises from 
variation in database sequence size and composition, 
which implies that each data point is really drawn frojn 
a separate distribution^-'^"''. Also, if many sequences 
related to the query are preseiit (see discussion on database 
redundancy below), it may be difficult to base the plotted 
line upon only unrelated sequences. An alternative "curw 
fitting" approach is to estimate the parameters of the 
implicit extreme vakie distribution for the scoring system 
at hand'''°'"''\ In one form or another, curve fitting will 
generally be necessary to calculate the statistical 
significance of scores derived from gapped alignments or 
other complex scoring systems'-'^'^ 

The most important "failure" of the local alignment 
statistics discussed here is on comparisons of regions with 
restricted or unusual amino acid or nucleotide 
composition. Such regions are quite common in proteins, 
but are clearly not well described by the same random 



m.odel used for other sequence regions (see below) . Because 
an alignment of such "low complexity" regions has htde 
real meaning, it is best simply to note their existence, but ' 
exclude them from alignments produced in database 
searches (see Figs 2 and 3 for examples).' • ' ' ■ ' '''' " ' ' 

Scoring matrices and gap costs 

Many different amino acid substitution score matrices 
have been proposed over the years for use with sequence 
comparison and database search programs^'"^\ and a 
variety of rationales have been used for their construction. 
However, it is possible to show that in the context of 
seeking high-scoring segment pairs without gaps, any 
such matrix has an implicit amino acid pair frequency 
distribution that- characterizes the alignments it is 
optimized for finding. More precisely, let p. be the 
frequency with which amino acid i occurs in proteins 
sequences and, wiuiia llic class of alignmcntG sought, let 
q.. be the frequency v/ith which amino acids i and*; are 
afigned. Then the scores that best distinguish these 
alignments from chance are given by the formula 

5.. = log^ 

The base of the logarithm is arbitrary, affecting only die 
scale of the scores. Any set of scores useful for local 
alignment can be wTitten in this form, so a choice of 
substitution matrix can be viewed as an implicit choice of 
"target frequencies" g..(refs 1,6). 

The target frequencies characterizing alignments of 
closely related sequences clearly differ from those for 
alignments of sequences that are greatly diverged. 
Therefore a single matrix can not be optimal for 
recognizingrelationships at all evolutionary distances 
1 ; has been argued that for most practical purposes, tlu'ee 
separate matrices should be adequate for locating all 
alignments containing sufficient information to rise above 
background noise''^ The question remains how best to 
estimate the appropriate corresponding target frequencies. 

Estimating the frequencies with which the various 
amino acids tend to mutate into one another is a 
jiccessarily empirical problem. The first approach to the 
question was taken by Dayhoff and coworkers^''^^ Their 
"PAM" model of molecular evolution allowed target 
frequencies and the corresponding score matrices to be 



<Fig. 2 Significant sequence matches of the human MTG8 product: the effect of low-complexity masking. MTG8 (ref. 84} is the translated 
product of a chromosome 8 gene involved in a t(8:21) translocation that results in an AML1-MTG8 fusion transcript in a case of acute 
. myeloid leukaemia (GenBank accession number D14820). a, Automated segmentation of low-complexity sequences in MTG8 at relatively 
high stringency. To be defined as low-compiexity in this run of the SEG algorithm (Box 2), a sequence region must contain at least one 12- 
residue window with complexity (K, Box 2) less than 0.31 5. SEG then finds the minimally probable (lowest P^. Box 2) low-complexity 
subsequence, of any length, within the overlapping windows of this region. The sequence segments read from left to right and their order in , 
the polypeptide runs from top to bottom, as shown by the central column of residue numbers, b. The strong match, which emerges clearly 
v/ithout masking (Poisson p-value 2.5 x IQ-*), between sections of MTG8 and Drosophila mefenogasfer transcription factor TFIID 1 10-kDa 
subunit*^^ c, MTG8 filtered as in (a) but with the low-compiexity segments masked by "x" characters, for use as a query sequence In 
database searches, d, The significant match between a region of MTG8 containing a cysteine cluster and rat apoptosis protein RP-8. RP-8 
(ref. 87) is a gene expressed early in the process of programmed cell death (apoptosis) following glucocorticoid Induction in rat thymocytes 
(GenBank accession number MSOeOI). This match»^, had a Poisson p-value of 0.0036 for a BLASTP search of the NCBI non-redundant 
database of 13th September 1 993. *, Identical amino acids; I, Conserved Cys or His residues. Also shown is a sample of the class of zinc- ^ 
fingers that occur in the DNA binding domain of the steroid receptor family«^ indicating a suggestive similarity (which is not statistically 
significant by pairwise alignment statistics and would require experimental confirmation) in the positions of most of the Cys or His residues. 

Before low-complexity filtering. MTG8 generated an output li-.' from the NCBI non-redundant database of greater than 400 Kbytes 
containing 599 database sequences scoring above the BLASTP default threshold. The significant match to apoptosis protein was an 
inconspicuous 62nd in this list and scored much lower than many spurious low-complexity matches. After masking of MTG8 as in (b), this 
match was 6th in a list of 83 sequences. The latter list contained many matches to a "medium complexity" region of MTG8 which is 
tentatively predicted to be alpha helical coiled coil (residues 416-^76). Further filtering with SEG at lower stringency [K < 0.365 for a 14- 
residue window) effectively masked this region, and resulted in a BLASTP output list of only 9 sequences, in which the apoptosis protein 
was ranked in score only below the MTG8 self-matches and the match to TFIID 1 1 0-kDa subunit. 

Genetics volume 6 february 1994 123 



review 




corraspbndvtG^he 20^ prb6^biHt[e^sr^.tf^^ 
. acids. ^F9r;the pN^|^pfi|^ equatiqnsf?f^^|^^;^§^ 

:.;SEGf J-is ;an^opt^ 's0g mentatipn'^^ /on the ihecry described !tjf . 
ide^tifies^vafa^id^ stringency/^ 




regp^rdiess a^Oal>§^jS^ and SEG 

. ^Prograrns 'sucH/a|^^ appt ppri ate;q u^ry -seque^^^^^^ rr; snts 

''<^lcurate^;'as scoreslinithatrow^^ columhVinsunng^ ; ;■; , ; 



matrices are perhaps nearly optimal 
for this more general case. Gappj?d 
alignments present the additio: .il 
problem of choosing appropriate gap 
costs^^ The simplest algorithms 
require these costs to be a Hnear 
function of gap length^^"^", but 
efficient algorithms for more general 
gap costs are also available^'. Because 
no theory exists, appropriate gap costs 
have generally been chosen by trial 
and error, although there have been 
some recent efforts to give l- is 
problem a sounder empirical 
footing"'^\ 

The user of database search 
programs should recognize that the 
defaultsubstitution scores and, where 
applicable, gap costs, have generally 
been chosen to be appropriate for the 
most frequent sort of queiy. These 
scores may not, however, be optimal 
for a specific problem. In partici .r, 
matrices such as PAM-120 or 
BLOSUM-62 (the current BLASTP 
default)*" are tailored for alignments 
of moderately diverged sequences. 
Ver}' strong but short similarities, or 
very long but weak ones may easily be 
missed by these matrices'"^ A fully- 
functional databuse ;;vaich system 
should therefore provide a ranr.e of 
scoring systems to itr. users, so i-it 
the algorithm can be adapted to he 
problem at hand. 



calculated for any desired amount of evolutionar)^ change. 
The details of the PAN4 model have been criticized''**, and 
the vast increase in available sequence data has prompted 
recalculation of the model's parameters'*^*''^ Scores for 
DNA sequence comparison based on a PAM-like 
mutational model have also been described^ A different 
approach to estimating appropriate target frequencies 
relies not on fitting an evolutionary model, but rather on 
the direct observation of relatively distant, but 
nevertheless presumed largely correct, sequence 
alignments'*'. A variety of empirical tests have been 
claimed to support the superiority of the resulting 
"BLOSUM" matrices for detecting sequence 
homolog/*'^-\ Lacking; an evolutionary model, hov/ever, 
this approach is less adaptable to generating matrices 
tailored to specific appli.-jations-'"'. 

The theory linking substitution matrices with target 
frequencies is rigorously established only for local 
alignn^erjt:;!v:kinggaps.There(;.'rethcdevciopmenttihove 
is generally valid on'- for ihc BLAS'" and related 
ak:orithms*'-'' A mr- i:v;ici;;i theory i jr alignmenis 
wii.h i^:ps .'.id. ' ..^v. -. . iuivc the san.ie broad 



o;:tlines 



tvcqi 



based ;i.:r,sLiti: 



ion 



Databases and access 
The most important requirement for 
database searching is a 
comprehensive, up-to-date database. 

Full releases of GenBank® now occur 

eveiy two months, and daily updates 
are available for downloading or direct searching b 
mail and network semces""*. GenBank has undergo..: a 
major expansion in data coverage and now includes, in 
addition to nucleotide sequences, data from the major 
protein sequence and protein structure databases, as well 
as data from U.S. and European patents^". Approximately 
36% of the records in GenBank are produced by the 
international collaborators, EM BL Data Library'-'^ and the 
DNA Database of Japan (DDBJ), with whom database 
updates are exchanged daily. Copies of the databases are 
available at many sites worldv^^ide^^-''^ 

GenBank (release 80.0) contains 164 megabase of 
sequence and is doubling in size eveiy 2] months \0- 
Benson, personal communication). This rate can only 
increase as a result of genome projects <-'id automated 
sequencing technolog^^ As mentioned above, special 
purpose computers have a role in maintai-iing reasonable 
search perfcnnance in the wake of this data deluge, but 
considerab. ' improv.- ments in search efficiency can be 
obtaii.ed bv considering the nature of the data i- [i. 

1\ i a n )' s c . : c nee d a t ab a ses h - ve a 1 a r gc de i; roe o ' : " ' ' 
"rcdund:r:.y'* for r istO:ij/l ^:Gn^ rdaltd to . 
techno!'.'!;)' ^^^-d isearc'. u-- r,, and ah ^ dui; -^^ 



1 



rrvii 



a! 
:d 
al 

as 
dr 
at 
■al 
se 
Us 
ial 
en 
lis 
:al 

ch 
he 

liy 
he 
;se 
lal 
ar, 
or 
TP 
Its 

or 
be 
dly 

of 

10 i 



for 



:.ur 
-tes 
' e- 
:e a 
, in 
ijor 
veU 
■ely 
the 
the 
?ase 
are 

; of 
(D.. 
•nly 
ited'. 
■ cial ■ 
able.- 
but 
ibe- 



mal- ; ' . 
abie^r: ; 
thei--;-;- 



to q 

0) o 

^ in ^ 
2 XI 



•rj 0 

to 



to C --1 

Q) O ■'1 

-4 to JQ 

'-H XI r> 

to D, 



S 8 



S 

3 
Q) 



. -I t-t rH M) ^ in fl m f-l 1 



vjj r- vD in H r- m 

ooonfNVDoj(N-civDinr^t-.oooooooooooo 
ooOr^r^^-^^-^.^fM;-l^\Hoooooooooooo 

rOnOOVDt-li-lTHrHf-HrHi— ( H 
VJ5 ki) (N 



O O JJ 
• C 
O O (U 

e 

O — Di 

« 

CO O 



t o o 

M U) o o 

0 O ( 

JJ >iO o 

(0 e . . c 

> o o o 

-H M c 

ij rj 

nj U 0) QJ 

U t^l 



w m e 

UT (1) (U W 

fN o y 

u o o c 

l-< M O 

c <d fO -H 



: O E W :3 ( 

On] x: 

H Q (TJ U • 

d -H 4J 0) ; 

x: cj (u E 

0 ^ XI g 1 

u 0] Id : 

0 «-I V W 



o c • 

M <l) O 

woo 



<l)(flD3W{l>UOn5NDO 



0Ord(n0XJT3O— f 



C C I 

O O w 
M w 

a a <D 

-I r-H C 

G) 

u u > 

-H OJ 

x: j= 

u O 
o o 
Qi a c 
>. >, o 
X X w 



x: iJ I I 
o to 

Q) OJ O O ( 

>^ rH r-H 

o o 



o "o o 



U (1) QJ 0) 



4J c x: x: . 
O -I c. 
<y 0) c c 

rH Jj -r-t .r4 ^ 

u o 0) <u (d 

13 Li jj iJ O 

c a o o -H 

U U Li 

(!) in Q, Q, QJ 

c (N x: 

o ui ui ij 
C Q r^i O 



U M tj U X 



I I u 

o 

C C C 0) 

O •'^ — I rH 

•H a OJ t.t 

M jJ 3 

•-H o o c 

> L( J-l 

•H Q, D, Q] 

13 C 

LI in 

.-I rvj fN) c 

^ Q g n3 

G) O Q 3 

u M M ej 



n Q. Lt IQ (0 

2 . S m 

«o; Dt I ij 

I rv) 0) 

^ I C (N £1 

C H . I 

-H , L( < < 

0 < JJ (N [N 

^ (N U X X 

0 [U 

M 0) Q. 0) (U 

ft c w c c 

O I o o 

U) jJ (0 iJ JJ 

Q CO iJ (0 CO 

Al -H 0) -H -H 

t& JZ £X JZ 



I o c 

I Ci > 
P o 



o o o o o 

o o o o o 

o o o o — 

O O O O H 

• ■ u 

o ir> o o a> 

1 00 . . aJ 

) o o in in )-t 

) ■ ^^ CO CO 

< in o o u 

00 — 1-H ^ 

- o X 

rH C O 

I U) > 

: — ^ c c 

0 (O o 

ij -H e E > 

rO X: 3 D 



-qi \B r]' r-~ n o 

Cr»t— (1— I rH r-( O CTi m 

r-T r-- I'-i 1-1 o vj> r- 

^ in un .- ■( oi rvj o 

r/1 r/1 CO rt; Q, CO w CO 

VD r- n o 
01 .-H ^1 w o m ri 
r> r- <*j w M5 r^- 
i-i in in tH ,-1 c:i fN o 
fM (S) fM ri ri (N iM 

CO 01 M < CC C'j U) CO 

)_| J-l Vt M U v.- L| 

D. a o. 'ji. D. 6. a b. 



>- O On 

03 O (N 

O M ,H 

ix a CO 



^ o CM in I 

CA 1-t I 

L/l O I 

O (N 00 I 

fN CO (N I 

^ a CO CO i 



<<<<<<<< 

fN <N f^t rsj <N (M IN 

xxxxxxxx 

cccccccc 
oooooooo 



xx:x:^^;x:jcx: 

ina)^Oir-irH{'4[i,r]Ta*(NT-tr"jTr 
1— I {Ti <o in r-~ d;o~, (N • J CD OD a% 
■Q" o in rH o a: X u.' •o.'i (N CO r- 
a)coin-qiin3r-(j'-i''iriinno 

CncQ<<<xtOXC0fy)01<C0^-3 



Vl 1^ t« U 



a 0. o, a (1, o. Q. a ci r;. a ri. ii. li : 



' iTj in in ri 1 



t-t in n m fN (Nj tN n n (-■> i\) n in 



' -^r <N (N 1 



a' ai 0 lU Q) UJ (U <u <u (U QJ a; (u QI tti G3 0) «:) cj c) Q) a) 
ooo-s'Or-mM'<a'[^r-inr-T-i-^rsrvJC0T-iriLnmorno 



ooo-3"(Ncv,Hr^, 



( (N n fN r- n ri ■ 



' tH 1-1 in in T-» w 



Fig. 3 

Nature Genetics volume 6 February 1994 



vi3vov£i-cit,-^L'~, iHrn^iHwint-HCOi-iininrHrj'vovovncjojcrv 
o o '^o r-( lti t3*i a\ rH in CTi 171 m (N 03 CTi cTi o to Oi Oi CO Oi m 

OOOOKO.-( i-li-IW.-lTH.--t,-l^ T-i 

r- r- <N 

• — 1 r-(01-( O >|(UQJOcnDi OOtJjJOrt-— 

OO WO-D— -O ^ i) ' U U O . -rHxJ 

• — UQ)>OU O-l OOO'— OiCCZJC 

ooijou o mw cocoo o 

•CaJ>ilOl-(jJO inojOJO OJ ^eo-^E 

OOQJfOe 'D. C -O.NUU *i ION - UDi 

E>oino 0)0 ■c;>,>iO •-wot-hoowio 

0— Di-HJ^O) ce OQEE CCOfOiJUU U 
(djJdJW — — (CD— U00o■r^.H EflS OC^w 

— (UMOX: Ort — MM-QJQ)O ED,l - O — 

^wu-trtoca)jjMQ) crdtdoxJiJ-io o-H 

CUD— O'HtoOy-'MQJ-HXX OOO E-<Q)C JJU) 

too <i)fl«DiJ— a(OQ)ouoLiL< c c-H— flw 

DSQ)CnC0*JODi 03UUO-Q,a — -HIOQ) -HD 
O toe— OEO-rHEOOnjtaooO Q) r-tjjQ)UO 
2iD<C M UU SUCOtO OO'-U — (JONO£ 

oxiJOit >i<i) I a o^^njjotj- — ^j.Hw 

It-HiTUtflO i-HiJ I •f-HrHCUl Cl<atfll 

CCiiDQJNO (TJOCOWW £ .^(U -ac 

.^.^ >'fHf-ixU'-t'»-ii-irtfd— x:x:cnx:EO)^i 

O0JrN<U tJlOO OajjJQJO) OUnJUOAJ-H QIOJ 

^•LJ TJt E-tXEiJC>l>''a-H-Hl-l-^t^OrHC't3j-l 

OOm-H OOHOOOO CL(Hiwu u -^.HO 

1- iuiwjJcox:p>x:uui i oi i— i ajacQJuj-i 

Q.DiO»0-H« 0).-i a EOJO) 0>C -rHjJOD. 

fHOOiWCCOC CCCr-4CCCCOa)tOOQ) 

i-t C *-< JJ P -H > .-H i-H O •■-( --^ < --^ -H (d -.-4 r-4 ^ C M ^ J= 

<0f0Q)yo3Q)rH 01(0.-4 0)0) ^^ErHU-HOjauo 

UU>3)-illiJOl4-)Ut0*JUlOO3O'-'rHiJ 
.M.^QJCa_^OM O-H-HOO MJhjCM iXrHCM 

ujjw 'Oi^acL(ij>uMcaa accajrn i 
OJQJ ajin©a>,-MQQ).Haa-i>,>,i (dO)a) 
^x:«MCfM4J km x:^ tn ^ -awm^scc 
jJiJO-HUdtnoCiniJ inincOOC-HCC-w -h-h 
Og CQ'W^^^^^Q)fNOl-^rMr^J<l)l^l^•f^iJa)a)JJr-^Cr-^ 

>i>,O3i(:3Q>,x(5>,0)UO x>,>,'3>,x a) siJ 
xxcoococQOx: Q)ux ucflwtuxx £x:ijj o)n.uo a 

iHV£i'^vi)"<air.r--rir--rnoirtO(NOrno\<NrocnCTvoininCT( 

r-ir^c^rirHVO'(3>r^^o*j)t^>*oa\r-fn-*cr(VDrN'q«\jjr-i'^3irH 
r-immrH.HiricDvo(N)fNOOQoo40n-q>ni-iinoM'coct)CT\ 

c/3coc/)(<Q.*<cncococococi:acoo5tncoiq;f^cAacocowo 

i-4vx>^u)'«ir^t^nr^roOioOfNOno^r)MO^o\OinLno\ 
CTirHtHr-irH^oncriCftmOT-^'fO^oo'^r^iDcrir-tNfHT-*-'* 
r)r-c^oiTH«>^r-^^U)r-irioo\r-nTi<o\iD<NTrv£)ro.oi»H 
rHminrHtHfnajvDOfsovDOtNOm-^rrnT-iino-^fcococri 
r>j(Noj.^:f(j<*><NOo)(Nr4(NC0'-H(N.HrH-*0'r>)O(Nor>jfN 
WWM<a<C0C0C0C0Wi<O-MWWM^(^C0a.C0t0C0a 

<H .H >H •H .H -H — -f *'^ '"^ 'H -H •>-( -H .fH -H -H 

Q.Q.aaaftaaaap.P'P'P'O.aaaaaaaaaa 



♦ * > * 



. la lO I~- 00 cri -C" v."? OJ 00 Tji Oi Di 

tn o m t-i in v,f. v.) (n \o r- -m 

in ■'T in (M m i-o M ui n o 

-tf <N rj o o O o o r-) o O 

< < U U) < OV r/) Uj <ct CO 



0 t» 
to lO 
to in 



» 00 
O V£) 

Q) 

r- II 



d) w 

a o 



UJ Cd tiJ 

CO + 

J .-5 ►J 

to 2 



03 JC ID CO 
Ol iJ • 



I-) CO 

is: 

s ^ 

ct; + 

CO CO CO 

a 2 

to < 
H H 



— n 

to n 

u •— 

X! 



II 







tx 


a. a 








4J 






[X IX. 








f-l 




a 


a Q. 






in 


fS 






a. 








O 




ti. 

□l. 


cx 
a CL 






0) 






to 


CO CO 








0 




a. 


a. a 








> 




(Jt 


Oi 












•z 


a 








0 




Cl, 


tli c. 








> 




Li. 
Lu 


a 

Cu 










00 




iX 






n 




n 




a 








in 


O 


S 


CL 






a. 


(S 


O 


X 


CL 








w 


O — 


Q 


CL 






C — 




• ^ 


CO 


Oi 






0 <*P 


C! 


O (N 




cx 








•H 


ri 




CL 






B n 


G 


u — 


LG 


a 


r- 




-H — 


4J 




CU 


cri 




0 


0 


CX, 00 


a 


CL Oi 


(N 




CL O 


M 


00 


a 


w 


rH 


n 






- ^ 


CL 


CL CL 






- 


0 


00 CTi 


O 


CO 


Q. 


a. CL 


in rH 


y 


ri (N 


J 


CL 


a, 


a cx 


T-( n 


>i 


o 


X 


CL 


> 


CO 


1 


»-* 


O 1) 




CL 


a 


Oi (X 


tU 11 


Dl 


o 


u. 


a tx 


a 


tX CL 






• to 


CO 


CO V) 


u 


(X 


• to 


0 


O 0) 


CO 


CL 




a 


IN iJ) 


y 


> 


w 


CL 




+ > 


> 


«f 


II 




c: 


CO 


a 


It -H 


IH 




g 


CL 


X 


a 




U 


JJ 




w 


►J 


CO 


iJ -H 


3 


C» to 


a 


a a 


Q 


a 


U to 


Q 


0) 0 


e- 


(X 




CO 


0) O 




a a. 


ce: 


CL 


to 


CL 


Q. a 




X 


^ 


CO 




a 


X 


€ 


tjj - 




CL cx 






CO - 






CO 


CL 








0 in 


* 0P 


DC 


OL tX. 


CL 


CX CL 




W 00 


— .-( 


a 


CO 


(X 


CL a 




H 




a 


CO 


CO 


CO U) 


W (N 


3 


AJ — 






a 


a CL 


±J • — 


CO II 




.J 




►J 


CO 






XI 00 


CL 


cx a 


X 


a 


h o 


x: 


00 


a< 




o: 


cx 


<7\ 


^ *j 


in \ 


CO 


g 


ce 


CL 


VD ^ 


U> DI 


• 00 


CO 


CO 




CL 


• in 


m c 




a 


Q. a 


O 


(X 


VO fN 


*o 0) 




a 


cx Oi 


X 


a 




< J 


— II 










11 






o 


o 


o 


o 




r- 


in to 


»H 


m 






O tfl 




o\ Q) 


fM 


fN 


<N 


<N 


00 0) 














-H 




II jj 










II u 


n 












-H 




0) u 










0) U 




u c 


rv 


4J 




iJ 


L< C 


M 


O 0) 


o 


u 


u 


0 0) 




U 






0) 




U "D 


a 

A 


CO i-i 


s 


Sb 


3 

o 


X) 
CO 


CO M 



CO + Cll 

a it; 

s 2 2 

U O p 

CO P 

CO -4- 



^ V ^ 

O U CJ 
ti, 

CL CL rx 

> > > 

a u! 

CO + 2 

n: CO 

J J J 

5 B 





c. 




U1 


o 


in 


o 




O? 


in 


OJ 








1-1 


o 




to 


rs) 


o> 






0) 


l» 


r- 






ly 


a>. 




o 




• rH 
















o 




jJ 
















CM 




















w 


QJ 


XJ 










0) iJ 








U 


C 


>, 


ij 


& 




u c 




xJ 


u 


O 


0) 


M 


o 




L> 


0 '!) 


it 


O 


-H 


u 




QJ 




Qt 




U 


'I' 




ft 






3 


X? 


3 


X? 


M K1 




X? 








a 


Cf) 


o 


CO 




6' 


CO 



<< + CO 

Q. CL CL 

CX CL a 

CO a 

O CL 

J CO 

X a 
a 

CO q: 

s a 

M CO 

M a 

to c- 

a a cx 

CO CO CO 

to CL 
CO 



a 

< + 

CO + 
CO 

CL CU 



to 
to 

_ 2 

o; (0 

a a. a 

a a a 

> + J 
a a a 
a CO 
a ^ 

> 2 
a a a 



a a a 

a a a 

a a a 

O > 



a a a 
CO CO to 
to o. 



> to 
a a a 
CO a 



tJj a 
a a a 
a a a 



5 



125 



review 




10;: 



I cfj a; 

1 CO 



M 

CO 
X 
Q 



a 

c/) 



tn 
a 

►J Q 
X Vi 

E Eh 

VH c; 

Q 

cn 



Sir' - 

eg 

Q Id 
Cd till 

^e: 

a a 

CO J 

a id: 
^ a. c/i 
X M c/i 

P a S 

rj) 2 
U E-" W 
X CO ,J 



t-I.HTHi-lT-(ir-( rHt-HiHfHr-liH 
I I t I 1 I I I I I I I 



W fll 

to a 



c 

E 
o 
•a 

CD 

£ 



-e 

CO 

X 

CO 



>> 


a* 




'tr 




jtc 


1 


E 




55 


< 


c 

c "5 




X 


g a 

E Q. 


Q) 


C 


O (/) 


isto 






X 


CL 



§1 



? 

p. 
a 



to '-I 
Q) O 
^ I: XI 

•a --1 ^ r> 
e o o 



to 



00 

I I OOOOOOOOOOOOOt-<tHrHi--(i-t»-'i-i,-i 
a QJOOOOOOOOOOOOOOOOOOOOO 
OOCOUJOOOOOOOOOOOOOOOOOOOOO 



O O fH I 



ooooooooooooooooooooo 



r^r^rOOO«MO\VDOOa3CDC30t3000aJC3CDOOVI>U)VD^OVDVI3VO 

tr^^nML/^a^c^cDCoaJoococJCOOOCDCDOOOOOoooooooaJa3a) 
Oi 00 CM 

CM (N 







I 




Oi 




CO 


O 


O 


O • • 


o " e 0) 














J J ^ 




o o 






1 


C 




1— 1 




■ o o o o 


o (0 (3 •'4 






i 








> w 






u 




O 






>, 




o ■ ■ 


• o ^ c to a e 














i£ + Cd 




O O JJ 


o 








c 


tj» 




o o o o 


o . 0) ti. a £ 


t/j u to 












O Q O 




• c 


Jj 






0 




C 




O ' • 


o X ^ — - 3 


C 3 c 








t^ 




< 2; 




O O <l) 


fit 








J= 


o 


c 


• o o o o 


— a u jc 


(0 O nj 








CN 




U OC 




e 


> 


u 


0 


i 


u 


u 




O ' • 


— O -H c C J-» 






u 




o 




2: + > 




O — CD 




H 






u 






o o o o 


-H to X H 






H 




o 










u 


CJ 








CO 


0 


O • ' 


-H O u x: to 

0) M 0 CJ 3 


r-i r-i 




c 




o 




l^d 




— 01 M 


u w 




to 






-H 


• o ir> o o 


ty 3 0) 












& + o 




w ^ 


(C 






a 






x: 


m 00 • • 


JJ <1) Q 1 l-t 1-1 g 


o 








o 


t3 


(H M M 








i 


1 




<v 


c 


u 


c» o o m i/i 


14 jJ ^ 3 3 O 


to XI « 










U) 


a X 




w o 


03 




0 




w 






O • —t 00 CD 


(d kt — c 


•-I C -H 




§ 




IJ 




O + cc 






Ui -H 


H-l 


c 




x: 




^ in o O 


o rt >i Cd flJ <d o 


jJ -H Jj 




•< 








it: D. 




o w 


C 




0 




1 


o 




00 — iH fH 


O r-4 in 0) 0) 


to •i-t 








a 


in 


H Eh 


<0 






0 


o 








o 


— o 


X <w . to w -H 


Ti wo 




c» 








M M M 




o 




0 


ti 




e 


3 


•H 


C 


o X m x: 


.O XI 




g 








td cc 




1 *H E 


0 


m 




u 






c 


Q) <1> 


> O 4J X 1 1 U 


.(0 1 fd 








r~ 




>■ CO 
X 


•u 




X 


4 


tH 


D 


0) 


0) 


0 


C — c c 


'H > •r^ CJ — 
0 r-l 3 a (-1 r-l 

> 0 M . (d nJ ai 


JC £ 




d 




(N 


n 


C 


C C 1 


q; 


H 


d 




a 


Q> 


& 


•H L> (d m 


J-l f-H M 








O 




CS + L£ 




-H 




Oi 


M 


fO 




CO 


> 4J e g 


o (d 0 




P< 




O 


II 


w 2 


1 


OJ Q) 


Q) 






(U 








O <d J= D D 


> »w 0) -o 'C cn 


c T3 c: 








O 




2 + Q 




AJ iJ 


•D 


1 


g 


to 








XI >-i o x j: 


t c (d (tj XJ 


0) (d 0) 




1 






to 


o: o: 




O o m 






O 




(0 








1 t 0 C C -H 


Id c (d 








O 


Ci> 


? a 


to 


u u tn 


JJ 


< 




1 


iJ 






1 t 1 1 t 


M -H o o 2: 


U 0 o 




< 






> 






a a (u 


o 


a 






<u 








M > Q U tJi O) 


Cn 








II 




2 2 2 


tn 


r-( 


© 


n 






XI 




cJ 


[S3 t>3 Cjl. CO 


M M > r 


1 1 










IJ 


a u o 


C 












t 


















Jj 








[tf Rf <y 


a 




M 


< 


< 




< ^ < 








CJ 


w 




)h 


o u > 














rsi PM fN (N (N 


CN rsl <N CM <N CN CN 


fM O! fN 








cu 


o 


+ w 


O 


^ fl) 


c 


« 


0 


X 


X 


X 


X 


X X X X X 


X X X X,X X X 


X'X X 








a Ok 


J J J 


0 








0 






















X 




W M M 


to 




(U 


G 




o 


0) 


QJ 


<u 


QI <u <u tu 


<u CJ d) o a) <^ tu 


01 a; Qj 








u 




§iS 




X JC 


c 


C 




c 


c 


C 


c 


C C C G C 


c c c c c c c 


C f C 




g 










x: 


AJ IJ O 




0 


1 


o 


o 


o 


0 


O O O O 0 


O O O 0 0 0 o 


o C 0 




0 (N 




<JP 


CO U 


tr, 


o o 


c 


«i 


& 


JJ 


JJ 




JJ 


U U aJ U JJ 


U JJ JJ iJ jJ 4J J-> 


jj J. jj 




*i 


m 






(-H 4- .J 




a a c 




a 




to 


10 


to 


f: 


tn tn to ta CO 


to to to to to to CO 


(A L'J to 




la 


r-l 


w 








>, >i o 


















• ^ -H -H -H -rH —1 


.p^ 








JJ 




U U Ci} 




X X CO 


O 


3! 




j:: 


!n 


x 




j=j=^j=£jcx:^j=^^x 


;c x: X 




2 


II 


•H 

n in 




c 


j-t KO ^ 


\D 










(N 




rM -sf rsj ^ ri 


-^1 v£i 00 in 2 


< W 00 




■«* 


x: 








CT> .-( rH 






in 


t-- 


ss 


0^ 




^ ^» 00 00 itf 


(Ti (Ti T-l 0\ C\ O 


rsj r> 




u> 


jj 


o 






o 


m r- 






t-i 


o 




n 


x. 


VD V£> Psl 00 VD 


t-H in p: n; o 


^ cc n 


to 


in 


cn 




o 




a 

'D 


-^ tn 






•«r 


in 






Cj m (-1 m m n 


o o CO ■<* D :r» cr> 


5 [-« ^ 




u-i 


C 


(N 


rM 




r<i (N <N 










CO 


o 


CO 


O O O O 


O O O O CO • -rH 


to to O 


c: 


•c 


0) 






O 


t/) CO 01 






< 




X 


CO 




CO CO c/) ri: w 


i-j CO CO X :.- C/) 


X :r c/j 










It 




































in O 


a 


^ u> 




r» 








rj 




f^j -^j rvj 1-1 


lo CO in -en V. rj' 


O o. CO 






(N 


tt 








V> 






CO 


C^ 




vi" ^ oa to ' / 


o% T-« m ; o 


(7\ m n 








Oy 


0) 




to 


. 




I": 








ri 




si) fN ; • 


r- ;- .-1 c o 


r- (A (N 




in 






-H 




d' 


If: 




1" 






[n 






f-1 M r** ij- 


O CL- • 


(N 




L'> 




ri 


*J 




0 


rj 
















O O O r ■. 


O O r> . 


o 


•d 


•3 










c 


W C/i 
















CO LI 01 < ■ 


• V ^ 1/.- < ■'*■ 


< CO 








0) 


Jj 


i> 


Q> 
































c 


*-! - ■ 
















k( u; :; 


IT I- ^ ' 




•y 








0) 


C 


& 


0. ■ 
















a o. : . a a 


!il a a a a c . 




f » 




































-5 













9 

a 
t 



11 



u V in 
0 O 



<! 

Sc 

po 

al 

ap 

de 

nu 

a!if 

So 

th( 

ali; 

ani 

mc 

no 

of 

c'r 

int. 

th; 

to; 



-.'4 



exis^:ence of clusters of closely- related sequences from 
multif^ene families. Also, cciuivalent gene products - ave 
freqi' - ndy been sequenced in a number of dilterent species 
or organisms. In release 36.0 of PIR InternationaP^ for 
example, there were 65: members of the globin 
superfamily, 349 cytochromes c, 583 sequences mth 
immunoglobulin domains and 274 prot■^nn kinases. 
Considering only perfectly matching sequences, among 
the 52,257 protein sequences in this database, there are 
over 3,900 dupUcate entries and over 3,800 perfect 
substrings of longer entries that together comprise about 
1 0% of the total amino acid residues. Among nucleic acid 
sequences there are thousands of Alu variants in GenBank. 
And the problem of redundancy is only getting worse: as 
a result of projects designed to sample expressed genes 
rapidly'' tens of thousands of sequence fragments are 
being added to the databases'^; many of these sequences 
represent small pieces of known genes. Due to the error- 
prone nature of these sennpnre fr?.gnientc^°, idcntif/ing 
redundancy in these collections is a more difficult task. 

As well as decreasing the speed of database searches, 
redundancy can obscure novel matches in the output, by 
)^elding slews of similar or identical alignments. Practically, 
there are two simple ways to avoid this problem: i) 
construct a smaller "nonredundant" database^^ ii) 
preprocess the query sequence for the presence of known 
domains and mask these prior to searching. (The concept 
of query masking is discussed in the next section.) 

NCBI^- maintains tv/o quasi-non redundant sequence 
collections (NRDB), one for proteins and one for nucleic 
acids. For example, the protein NRD3 is constructed 
iterntively starting with SWlSS-PROr^ which is the 
smallest and least redundant of the major protein 
datiibases. Al! of the proteins in PIR InternationaP^' are 
compared to those in SWISS-PROT, and identical 
sequences are excluded from the former while maintaining 
pointers to relevant annotation. Next, all of the protein 
translations from GenBank coding sequences ("GenPept") 
are compared to the merged SWISS-PROT plus PIR. 
Likewise, protein sequences from the Brookhaven 
structure database (PDB) and other sources are 
incorporated into NRDB. (The OWL nonredundant 
sequencedatabase^' is constructed from the samesources.) 
This simple procedure reduces the size of the combined 
databases by 50%, yet ensures that all sequences are 
represented. More sophisticated methods for creating 



:ierived, composite v: .ws of proteii': and DNA aence 
data promise even further reductions-'^"*. 

Another key issue is access to the databases. Researchers 
may perform da abase similarity searches remotely by 
sending their queries, via electronic niail, to centralized 
"server" computers, where large and frequendy updated 
databases are maintained, and w^here fast processors and 
sophisticated software are available. E-mail services of 
this sort have been available from various sources for 
several years. For example, NCBI provides the BLAST e- 
mail server (for more information, send a "help" message 
to the Internet address blast@ncbi.nim.nih.gov), and 
EMBL provides Blitz (nethelp@embl-heidelberg.de). 
Additional sites and servaces are given in vef. 64. In addition 
to database search and retrieval services, such sites maintain 
repositories of public domain software and specialized 
datasets that may be accessed via "anonymous ftp" over 
the Internet^^ The existence of high-performance network.*; 

Is also giving rise to a new generation of "client-server 
appUcations" that make possible direct, real-time user 
interactions with remote sei-vers. NCBI's BLAST network 
service and Entrez retrieval system are two examples. For 
users of the many excellent commercial so ftware packages 
for sequence analysis, we woidd anticipate the development 
of network client-server capabilities in the near future. 

Masking of low-complexity sequences 
Interspersed local regions of very simple amino acid 
composition are surprisingly abundant in protein 
sequences". Some of these regions are homopolymers or 
si iort-period repeats, but most are not pc riodic and appear 
as mosaics of predominantly one or a fev.^ types of residue. 
Their compositional bias is in marked contrast to the 
structural domains and motifs of globular proteins familiar 
from crystal and NMR structures. Based on a relatively 
stringent definition of low-compiexit/'', more than half 
of the sequences in the database contain at least one such 
region, and 14% of the amino acids occur in clusters of 
highly biased local composition. Moreover, a large excess 
of "medium-complexity" regions may be defined using a 
less stringent definidon of complexity: these are found in 
many recently-deduced protein sequences that lack true 
homologues and do not belong to the class of "ancient 
conserved sequences"^^ Very litde is known about the 
molecular structures, dynam ics, interactions and evolution 
of most low- and medium -complexit)' protein segments. 



-IFig. 3 The mouse protein Sos1 functions as a key intermediate in transmitting signals from receptor tyrosine kinases to ras via protein-protein interactions^ . 
5: J31 (PIR accession 821391) is a member of a family of ras guanine nucleotide-releasing proteins (GNRP) that also includes S. cerevisiae CDC25 and SDC25, S. 
p.mbe Ste6 and the Drosophila gene, Son of sevenless^\ Mouse Sos1 is a large, mosaic protein with several different domains, including a rasGNRP domain and 
a low comple j region that binds to an "adapter" protein called Grb2». a. Results of a BLASTP search using an Sosi query sequence without any masking 
applied. In ac • ion to several "self hits" in the output, we see significant matches to some S. cerevisiae proteins, but Ste6 does not appear in the top 2o matches 
despite its pr-., .:nce in the database (PIR International, release 37). Moreover, the true positive matches are interspersed with many false positives, consisting of a 
number of funciionally unrelated proline-rich proteins. These artifactual matches are highly significant in the statistical sense, but a glance at some of the local 
alignments shows that one is not justified in inferring similar function despite the high scores and low p-values. An identical search, except that in this case the 
Sosi query has been pre-processed using SEG masking with default parameters. Note that the top of the "hit list" is now populated only by bona fide members of 
the rasGNRP family and that all artifactual matches against proline-rich proteins have,disappeared. Furthermore, a match to S. pombe. Ste6 is now obvious; a local 
alignment between this protein and Sosi is shown. Interestingly. Sos1 shows significant local similarities to histone H2A and p-spectnn (see below), c, Results of 
another search with masking of both low complexity regions (b) and the rasGNRP domain. The top four matches now consist only of those proteins that share 
more extensive, or global, similarity with the query beyond the rasGNRP domain. In this example, the additional information gained by this extra masking step is 
nu striking. But one can imagine the dramatic effect this would have in shrinking the "hit list" if the query possessed a kinase domain, of which there are hundreds 
of examples in the database. (See ref. 74 for an example involving immunoglobulin domains), d. The query sequence, mouse Sosi . annotated with the vanous 
domains indentifiable by BU^STP searching. The rasGNRP domain is according to Boguski & McGormick^'. The proline-rich carboxy temninal region is known to 
interact with Src homology {SH3) domains in Grb2^. With regard to the local similarities between Sosi and histone H2A and p-spectnn. it has recently been shown 
that Sosi, p-spectrin and a number of other proteins possess "pleckstrin homology" or PH domains". TTie local alignment produced by BLASTP (c) corresponds 
to these PH domains. The similarity between Sosi and histone H2A has not previously been reported and is difficult to interpret biologically. Nonetheless, the 
similarity is as significant as that of the PH domain and may have structural, as opposed to functional, implications*'. 

127 

Nature Genetics volume 6 February 1994 



Low-complexity segments confound database search 
algorithms in two ways. First, most of these segments do 
not generally give meaningful alignments position by 
position in ways that reflect actual structure and m uta tional 
history: they evidently evolve relatively rapidly by processes 
such as replication slippage and repeat expansion". (At 
the DNA sequence level, trinucleotide and dinucleotide 
repeat polymorphisms provide a familiar example**^*^°.) 
Permutations, shuffles or reversals of low- complexity 
amino acid sequences generally give alignment scores 
similar to the original sequence. Second, the residue 
compositions of low-complexity segments are very 
different from that of the database as a whole. This is 
evident if all low-complexity segments in the database are 
grouped into a single class: a strong excess of alanine, 

alvrinp T^rnli'np cprinp alnfomatp anH alntaminp rpiiilfc;, 
, ^ , - - , (J — - 

However, this lumped class is itself heterogeneous, 
containing for example glutamine-rich and proline-rich 
subclasses. These statistical biases contrast with those that 
characterize the bulk of most query and database 
sequences, and on which score-based alignment statistics 
are founded. Thus the high scores of alignments of low- 
complexity segments are due primarily to their 
compositional biases and do not necessarily reflect 
significant positional similarity. 

Several classes of low-complexity residue clusters have 
been analysed for statistical significance by Karlin and 
coworkers^'-". Their methods, which use the contrasting 
residue frequencies of specific clusters and those of 
complete proteins or databases, are embodied in the SAPS 
software". SEG", the algorithm employed by the BLAST 
programs for filtering low-complexity segments from 
queiy sequences prior to database searching (Figs 2 and 
3 ) , employs instead optimal segmentation methods applied 
to a more general definition of compositional complexity 
(see Box 2). 

iVIasking of highly abundant sequences 

Database searching can be performed efficiently i n phases, 
with a query first compared to a small database containing 
domains representative of large sequence families. 
Subsequences of a query that match one or more of these 
domains can then be masked prior to full-scale searching, 
thereby eliminating most of the redundant output^^ 
Annotated collections of prototypic human repetitive 
sequences^^ such as Alu and protein kinase catalytic 
domains'^ exist and can be used to pre-filter a querj' (Fig 
3c). (Both of these data sets are available from the NCBI 
Data Repository on CD-ROM and by anonymous ftp. See 
/repbase/alu,/repbase/humrep and/pkinases/pkcdd.faat 
ncbi.nhn.nih.gov.) For proteins, a more comprehensive 
solution to the problem is approached by building a small, 
representative set of protein superfamilies or motifs and 
using this as a screening database with automatic masking 

1 . Ailschul. S.F. Amino acid substitution matrices from an information theoretic 
perspective. J. molec. Biol. 219. 555-565 (1991). 

2. Alischul, S.F. A protein alignment scoring system sensitive at all evolutionary 
distances. J. molec. Evol. 36. 290-300 (1993). 

3. States. DJ.. Gish, W. & Altschul, S.F. Improved sensitivity of nucleic acid 
database searches using application-specific scoring matrices. Methods 3. 
6&-70 099-:). 

A . Gish. W. & State?. P.J. loentification of proit-in oc:fi"5H rt=.gions by datn!)ase 
simitanty searcJ-i. •■■j;u''C Gencl. 3. 265-272 {1993}. 

5. Clave-ie. J.-M. ! . . riir.;: frameshifts by ami: i acid sequence co-.-ianson. 
J. nic: -r Biol. 2:. * ' ■?■* (1993). 

6. Kv:;r ?.AIt5Cl- . .ii.F. ' ■ w-ds to: .-. -rif- :in^ static':!..- ' 2 
o' ir.:^ ocular sequor.ce if^: tv; . ci f^r-r,cr. • ;'Cor:r;a sz''y'-:.- 

nsU I. Acad. Sc;. U. S.A. 87 4 . .; ■ 1930). 



of matching query subsequences (unpublished results). 
This technology is stiU .lender development .ljut,re,pen^ 
studies indicate that a representative set of only' lj66()- ' 
3,000 sequences may suffice^^; such a databsise can be 
searched in seconds. The first large-scale implementation 
of this strategy has been performed for a specialized 
database of "expressed sequence tags" or ESTs^° where 
such pre-filtering is also employed to detect contamination 
by vector sequences. 

Conclusions 

The stated goals of the U.S. Genome Project include the 
production of 50 megabases of DNA sequence data prv 
year by 1998 and the identification and correlation of 
genes in humans and model organisms'^ Database 
sijTiilarify sparr.hing will he one of the major informatir.*; 
tools used in this endeavor. Not only efficient algorithms, 
but also a choice of appropriate scoring systems, well- 
defined measures of statistical significance and a better 
understanding of the sequences themselves, are critical 
for the automated analysis schemes that this amount of 
data will inevitably require. 

Special purpose and faster general purpose compute- s 
will have roles in sifting through this increasing volume \. f 
sequence data. But large improvements in the efficiency 
of searching can be obtained by considering the nature of 
the data and implementing new strategies that capitalize 
on this knowledge. One of these strategies is to preprocess 
a query sequence to identify known domains and motifs, 
dispersed repeats, low complexity segments and other 
regions of compositional bias such as potential membrane- 
soanning and a-hehcal coiled-coil regions. We have 
describedseveralpreprocessinfjtechniqucsthataresuitah'e 
for automation and have demonstrated their practi - al 
utility w^th examples. Foreknowledge of queiy features 
enables one to perform faster and more effective searches 
better and to evaluate search results. 

Another, complementary strategy is to reduce the 
redundancy in the target databa5e(s) to be searched. 
We have outlined one simple but useful approach to 
the reductive merging of diverse, but overlapping, 
source databases. But newer, cleaner and richer views 
of the sequence data, optimized for gene discovery, nre 
on the horizon. 

Note added in proof: NCBI has recently established a 
GenBank® World Wide Web server (the URL is http:/ . 
/www.ncbi.nlm.nih.gov) that provides network access 

to many of the software tools and data sources described 
in this review. 



Acknowiedgements 

GenBank is a registered trademark of the U.S. Department ofHcrMit 
ajuiHinuatiScn'ices. 

7. Kartin. S.. Dembo. A. & Kawabata, T. Statistical composition of high-scoHng 
segments from molecular sequences. Ann. Stat. 18. 571-581 (1990). 

8. Dembo. A. & Karlin, S. Strong limit theorems o( empir ical f unctionats for larg 
exceedances ol partial sums of i.i.d. variables. Ann. Prob. 19. 1737-17 
(1991). 

9. Karlin, S. & Altschul. S.F. Applications and statistics for multiple high-scoring 
segmenti; in moiecuiar sequences. Proc. naln. Acad. Sci U.S.A. 90. 55/ 

5B77{1933), , 

10 Smith T.F.. Waterman, f^.S.& Bui l;r..C. The slalisiicat distribution o.nuc.t-.w 

acid s'imilai'ilies. Nuc! Act'.l^ Re; . : 645-o56 (19G5). ^. 

11 Altschul & E^lr-.^;•_^ B.'A A nonlinear i: -asure of subaliot ■ 

* simi*arityunditss;c:.-.r.:^-!o:.V:..e^;-rvh.*^ . 
y 0'^'= :is J.F..Coulson.A FAV.:;Ly;-.!:.A *; ho > ■:;•.!'•. vu:(;ci p>uu i-w. 
sr- Jaritios. CAS/OS <.'jy-n (1938). 



S 

;e 

a 
:/ 
ss 



Hh 



rge . 
•55 



ing 

J2r . 
:eic ■ 
•ent ■■ 
■nco 



13. M ■ R. Maximum -iveiihood estimation of the statistical dtrrtribution of 

Si; -Waterman Ic, >! sequence similarity scores. Bull. math. BioL 54, 59- 53. 
75 i 1992). 

1 4. Altschui. S.F., Gish:. W.. Miller, W., Myers. E.W. & U"^nan, D.J. Basic local 
alignment search J, mo/ec. Bid. 215, 403-41 0 ; 1990). 54. 

..15. Needleman.S.B.&yVunsch, CD, A general method applicable to the:, larch • • 
for similarities in the amino acid sequences of two f.roteins. J. mo/ec. eiol. ;5. 
-48,443^53(1970). 

16. Sellers. P.H. On the theory and computation of evolutionary distances. SlAM 56. 
J. appi Math. 2S, 787-793 (1974). 

17. Sankoff, 0. & Kruskal, J.B. Time Warps, String Edits and Macronnotecutes: 57. 
The Theory and Practice of Sequence Comparison {Addison-Wesley, Reading, 

MA, 1983). 58. 

18. Smith. T.F. & Waterman, M.S. Identification of common molecular 
subsequences. J. motec. Biol. 147, 1 95-197 (1981). 59. 

19. Goad, W.B, & Kanehisa, M.I. Pattern recognition in nucleic acid sequences. 60. 
LA general method for finding local homologies and symmetries. Nud. Acids 

Res. 10,247-263(1982). 61. 

20. Sellers, P.H. Pattern recognition in genetic sequences by mismatch density. 

BuH. math. Bio!. 46, 5Q1 -51 4 (1 984). 62 . 

21. Waterman, M.S. & Eggert. M. A new algorithm for best subsequence 
alignments with applications to tRNA-rRNA comparisons. J. motec. Biol. 63. 
197, 723-728 (1987). 

22. Coulson, A.F.W., Collins. J.F. & Lyall, A. Protein and nucleic acid database 64. 
searching: a suitable case for parallel processing. Comp. J. 30. 420-424 

f 1987V fiq 

23. Chow, E.T., Hunkapiller. T., Peterson, J.C., Zimmerman, B.A. & Waterman, 

M .S. in Prac. 7 997 Int. Conf. on Supercomputing, 21 6-223 (ACM Press, New 66. 
York, 1991). 

24. Jones, R. Sequence pattern matching on a massively parallel 67. 
computer.C/lS/OS 8. 377-383 (1992). 

25. Brutlag, D.L et at. Bt-AZE: an implementation of the Smith-Waterman 68. 
sequence comparison algorithm on a massively parallel computer. Comput. 
Chem. 17, 20^-207 (1993). 

26. Stun-ock, S.S. & Collins, J.F. MPsrch version 1.3. (Biocomputing Research 69. 
Unit, University of Edinburgh, 1993). 

27. Lipman, D.J. & Pearson, W.R. Rapid and sensitive protein similarity searches. 70. 
Science 227. 1435-1441 (1985). 

28. Pearson, W.R. & Lipman, D.J. Improved tools for biological sequence 71. 
cnmparison. Proc. natn. Acad. Sd. U.S.A. 85, 2444-2448 (1 988). 

29. White, C.T. et at. in Proc. 1991 IEEE Int. Conf. Comp. Design: VLSI in 72. 
Computers and Processors. 504-509 (IEEE Comp. Soc. Press, Los Alar: ;itos. 

CA, 1991). . 73. 

oO. Pearson, W.R. Searching protein sequence libraries: comparison of the 
somoitivity and selectivity of the Smith -Waterman and FASTA algorilhms. 
G-j nomics 1 1 . 635-650 (1 99 1 ). 74 . 

31. Altschui. S.F. & Lipman, D.J. Protein database searches for rnuliiplj 
alignments. Proc. natn. Acad. Set. U.S.A. 87, 5509-5513 (1990). 75, 

32. Argos, P. A sensitive procedure to compare amino acid sequences. J. mo/ec. 

e/o/. 193,385-396 (1987). 75. 

33. V^ogt, G.& Argos, P. Searching for distantly related protein sequences in large 
databases by parallel processing on a transputer machine. CABIOS S, 49- 
55(1992). 77. 

34. McLachlan, A.D. Tests for comparing related amino-acid sequences. 
Cytochrome c and cytochrome c^^^. J. molec. Biol 61, 409-424 (1971). 78. 

35. Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. in Atlas of Protein Sequence 
sndStnjcture vol. 5. suppl. 3 (ed. M.O. DayhofO 345-352 (Natn. Biomed. Res. 79. 
Found., Washington. 1978). 

36. Schv/artz, R.M. & Dayhoff, M.O. in Atlas of Protein Sequence and Stivcture 80. 
vol. 5, suppl. 3 (ed. M.O. Dayhoff) 353-358 (Natn. Biomed. Res. Found., 
Washington, 1978). 81. 

37. Feng, D.F., Johnson, M.S. & Doolittle, R.F. Aligning amino acid sequences: 
comperison of commonly used methods. J. motec. Bvol. 21 , 1 1 2-1 25 (1 985). 82. 

38. Rao, J.K.M. New scoring matrix for amino acid residue exchanges based on 
residue characteristic physical parameters. Int. J. peptide protein Res. 29, 
276-281 (1987). 83. 

39. Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. Amino acid substitutions 
in structurally related proteins. A pattern recognition approach. Determination 

of a new and efficient scoring matrix. J. motec. Bloi. 204, 1 019-1 029 (1 988). 84. 

40. Gonnet, G.H.. Cohen, M.A. & Benner, S.A. Exhaustive matching of the entire 
protein sequence database. Science 256. 1443-1445 (1992). 

41 . Henlkoff, S. & Henlkoff. J.G. Amino acid substitution matrices from protein 85. 
blocks. Proc. natn. Acad. Scl. U.S.A. 89, 10915-10919 (1992). 

42. Jones, D.T., Taylor, W.R. & Thornton, J.M. The rapid generation of mutation 
data matrices from protein sequences. CABIOS 8, 275-282 (1992). 

43. Overington. J., Donnelly, D., Johnson, M.S.. Sail. A. & Blundell, T.L 86. 
Environment-specific amino acid substitution tables: tertiary templates and 
prediction of protein folds. Prof. Sci. 1, 216-226 (1992). 87. 

44. Wilbur, W.J. On the PAM matrix model of protein evolution. Mo/ec. Biol. Evol. 
2, 434-447 (1985). 

45. Henlkoff, S. & Henlkoff, J.G. Perfomnance evaluation of amino acid substitution 88. 
matrices. Proteins 17, 49-61 (1993). 

46. Waterman, M.S., Gordon, L. & Anratia, R. Phase transitions in sequence 89. 
matches and nucleic acid structure. Proc. natn. Acad. Sci. U.S.A. 84, 1 239- 90 
1243(1987). 91. 

47. Fitch, W.M. & Smith, T.F. Optimal sequence alignments. Proc. natn. Acad. 

Sd. U.S.A. 80, 1 382-1 386 (1 983). 92. 

48. Gotoh, 0. An improved algorithm for matching biological sequences. J. 
motec. Biol. 162, 705-708 (1982). 

49. Altschui, S.F. & Erickson, B.W. Optimal sequence alignment using affine gap 93. 
costs. Butt. math. Blot. 48, 603-616 (1986). 

50. Myers, E.W. a Miller, W. Optimal alignments in linear space. CABIOS 4, 1 1- 
17(1988). . 94. 

51. Miller, W. & Myers, E.W. Sequence comparison with concave weighting 
functions. Butt. math. Biol. 50, 97-120 (1988). 

52. Pascarella, S. & Argos, P. Analysis of insertions/deletions In protein 



Nature Genetics volume 6 february 1994 



structures. J. molec. Blot. 224, 461-471 (1992). 
Benner, S.A., Cohen, M.A. & Gcr.ntjt. G.H. Empirical and stnjctural models 
for insertions and deletions in :hc- divergent evolution of protains. J. molec. 
B/o/. 229. 1065-1082(1993). . 

Benson. V>.. Lipman, D.J. & Ostell, J. GenBank. Nud. Acids Res. 21, 2Gf:3- 
2965(1993). 

Rice, CM., Fuchs. R.. Htggins, D.G.. Stoehr, P.J. & Cameron, G.N. The 
EMBL data library. Nud. Adds Res. 21. 2967-2971 (1993). 
Barker, W.C., George, D.G.. Mewes, H.-W.. Pfeiffer, F. STsugita, A. The PIR- 
International databases. Nud. Adds Res. 21. 3089-3092 (T993). 
Adams, M.D. ef at. Complementary DNA sequencing: expressed sequence 
tags and human genome project. Science 252, 1651-1656 (1991). 
Sikela. J.M. & Auffray. C. Finding new genes faster than ever. Nature Genet. 
3. I897I91 (1993). 

Davies. K. The EST express gathers steam.. Nafure 364, 554 (1993). 
Boguski, M.S., Lowe, T.M.J. & Tolstoshev, CM. dbEST — database for 
"expressed sequence tags". Nature Genet. 4, 332-333 (1993). 
Bleasby, A.J, & Wootton, J.C Construction of validated, non-redundant 
composite sequence databases. Protein Eng. 3, 153-1 59 (1990). 
Benson, D., Boguski, M., Lipman, D.J. & Ostell, J. The national center for 
biotechnology information. Genomics 6, 389-391 (1990). 
Bairoch. A. & Boeckmann. B. The SWISS-PROT protein sequence data 
bank, recent developments. Nud. Acids Res. 21, 3093-3096 (1993). 
Henlkoff, S. Sequence analysis by electronic mail server. Trends biochem. 
Sci. 18, 267-268(1993). 

Kml P Tho Whnio lnfotnf)t I /^pr'T Giiirip Z Cflfp/nn (n'Rpilly A A<;cnr Inr 

Sebastopol, CA, 1992). 

Network Entrez. NCBlNews 2(2), 1 (National Library of Medicine, Bethesda, 
MD. 1993). 

Wootton, J.C. a. Federhen, S. Statistics of local complexity in amino acid 
sequences and sequence databases. Comput. Chem. 17. 149-163 (1993). 
Green, P., Lipman, D., Hillier, L., Waterston. R., States, D.J. & Claverie, J.- 
M. Ancient conserved regions in new gene sequences. Science 259, 1 71 1- 
1716(1993). 

Riggins, G.J. e( at. Human genes containing polymorphic trinucleotide 
repeats. Wafure Genet. 2, 186-191 (1992). 

Harding R.M., Boyce A.J. & Clegg, J.B. The evolution of tandemly repetitiv-- 
DNA: recombination rules. Gene)ics 132. 847-859 (1992). 
Kariin, S. & Brendel, V. Charge configurations in viral proteins. Proc. mtr.. 
Acad. Scl. U.S.A. 85. 9396-9400 (1988). 

Kariin, S. & Brendel, V. Charge and statistical significance in protein and DNA 
sequence analysis. Science 257, 39-49 (1992). 

Brendel. V., Bucher, P., Nourbakhsh, I.R., Biaisdell, B.E. & Kariin, S. Methcc's 
and algorithms for statistical analysis of protein sequences. Proc. naln. 
Acad. Scl. U.S.A. 89, 2002-2000 (1992). 

Claverie, J.-M. & States, D.J. Information enchancement methods for large 

scale sequence analysis. Comput. Chom. 17, 191-201 (1993). 

Jurka, J., Walichiewicz, J. & Milosavijevic, A. Prototypic sequences for 

human repetitive DNA. J. motec. Evol. 35, 286-291 (1992). 

Hanks, S.K. & Quinn, A.M. Protein kinase catalytic domain sequence . 

database: identification of conserved features of primary structure and 

classification of family members. Meth. Enzymoi. 200, 38-62 (1991). 

Collins, F. & Galas, D. A new five-year plan for the U.S. human genomo 

project. Science 262. 43-45 (1993). 

Gumbel, E.J. Statistics of extremes. (Columbia Univ. Press, New York. 
1958). 

An-atia, R., Gordon, L & Waterman, M.S. An extreme value theory for 
sequence matching. Ann. Stat. 14, 971-993 (1986). 
Arratia, R., Morris, P. & Waterman. M.S. Stochastic scrabble: large deviations 
for sequences v/ith scores. J. appt. Prob. 25. 106-1 19 (1988). 
Arratia, R. & Waterman. M.S. The Erdos-Renyi strong lav/ for pattern match- 
ing with a given proportion of mismatches. Ann. Prob. 17, 1 152-1 169 (1989). 
Salamon, P. S Konopka, A.K. A maximum entropy principle for distribution 
of local complexity in naturally occurring nucleotide sequences. Comput. 
Chem. 16, 117-124 (1992). 

Salamon, P.. Wootton, J.C, Konopka, A.K. & Hansen, L. On the robustness 

of maximum entropy relationships for complexity distributions of nucleotide 

sequences. Comput. Chem. 17, 135-148 (1993). 

Miyoshi, H. etat. The t(8:21) translocation in acute myeloid leukemia results 

in production of an AML1-MTG 8 fusion transcript EI^BOJ. 12, 2715-2721 

(1993). 

Kokubo, T., Gong. D-W., Roeder, R.G., Horikoshi, M. & Nakatani, Y. The 
Drosophila 110-kDa TFIID subunit directly interacts with the N-terminal 
region of the 230-kDa subunit. Proc. natn. Acad. Scl. U.S.A. 90, 5895-5900 
(1993). 

Hoey, T. ef at. Molecular cloning and, functional analysis of Drosophila 
TAF1 1 0 reveal properties expected of coactivators. Cell 72, 247-260 (1 993). 
Owens, G.P., Hahn, W.E. & Cohen, J.J. Identification of mRNAs associated 
with programmed cell death in immature thymocytes. Mo/, cell. Blot. 1 1 , 
4177-4188 (1991). 

Schwabe, J.W., Neuhaus, D. & Rhodes, D. Solution structure of the DNA- 
binding domain of the oestrogen receptor. Nature 348, 458-461 (1990). 
Feig, LA. The many roads that lead to Ras. Science 260, 767-768 (1993). 
McCormick, F. How receptors turn Ras on. Nature 363, 15-16 (1993). 
Boguski, M.S. & McCormick, F. Proteins regulating Ras and its relatives. 
Nature 366, 643-654(1993). 

Rozakts-Adcock, M., Femley. R., Wade, J.. Pav;son, T. a Bowtell, D. The 
SH2 and SH3 domains of mammalian Grb2 couple the EGF receptor to the 
Ras activator mSosl. /Mature 363, 83-85 (1993). 

Musacchio, A., Gibson, T., Rice, P.. Thompson, J. & Saraste, M. The PH 
domain is a common piece in the structural patchwork of signalling (and 
other) proteins. Trends biochem. Scl. 18, 343-348 (1993). 
Arents, G., Buriingame, R.W., Wang, B.C., Love, W.E. a Moudrianakis E.N. 
Thenucleosomal corehistone octamer at 3.1 A resolution: a tripartite protein 
assembly and a left-handed superhelix. Proc. natn. Acad. Sd. U.S.A. 88. 
10148-10152 (1991). 



129 



18069302 



333 



Repeats in genomic DNA: mining and meaning 

Jerzy Jurka 



For hundreds of millions of years, perhaps from the very 
beginning of their evolutionary history, eukaryotic cells have 
been habitats and junkyards for countless generations of 
transposable elements, preserved in repetitive DNA 
sequences. Analysis of these sequences, combined with 
experimental research, reveals a history of complex 
'intracellular ecosystems* of transposable elements that are 
inseparably associated v^ith genomic evolution. 

Addresses 

GenRtic Informafion Research Institute. 1 1 70 Morse Avenue. 
Sunnyvale, CA 94089, USA; e-mail: jurka@charon.girinst.org 

Current Opinion in Structural Biology 1998. 8:333-337 

http://bionriednet.conn/elecref/0959440X00800333 

c Current Biology Ltd ISSN 0959-440X 

Abbreviations 

L1-EN endonucleolytic domain in LI reverse transcriptase 

LINE long interspersed nuclear element 

LTR long terminal repeat 

MIR mammatian-wide interspersed repeat 

SINE short interspersed nuclear element 

TE transposable element 

TSD target site duplication 

Introduction 

Repetitive I)N.\ is a major component of eukaryotic 
genomes, rnderstandin^ its ()ri;;in, e\ ()hition, and ;^enetie 
impact upon the host DNA is therefore of fundamental 
importance for genome studies. There are two major 
groups of repeats in eukaryotic ^^eaomes: tandemly repeat- 
ed satellites, usually eon fined to s pec i tie chrom(»snmal 
regions: and the repeats interspersed with /genomic DNA 
til at are the major focus (»f this rc\ iew. interspersed 
repeats represent mostly inacti\ e copies of a wide \ ariety 
of contemporarily and historically acri\ e transposable ele- 
ments ('I'Ks) such as: retroelements and DNA rrans- 
posons, which can each he further subdivided into distinct 
classes Repetiti\e sequences have been recruited as 
functional components of eiikary tJtic genomes, which doc- 
lunents their c(»ntribution to f^enomie evolution |2-I>1. 
I'hey are also an important source of knowiedj^e about the 
bioloj^y of active TK.s. I'he emer^iin^^ picture, bolstered by 
recent research, is that TKs are not merely * parasites'. 
Rather, they are integral players in j^enomic evolution, 
showin;^ either a "selfish' or an 'altruistic' nature, depend- 
in on different evolutionary circumstances. 

Reconstruction and analysis of repetitive DNA 

As stated above, interspersed repetitive secpiences repre- 
sent inactive (pseudo^^eneKopies of historically tireonteni- 
pomrily active TKs. The study of a new TK usually bej^ins 
with the identification of its repc-ated copies, followed by 
set|uence alignment, classification into subfamilies <if 



applicable) and construction of consensus se(|uences 17]. 
Apart from the original 'I'Ks themselves, consensus 
se<|ucnces represent the best available approximations of 
the ori.t;inal active 'I'l^s that generated the repeats. Ki^ure 1 
illustrates the relationship between the similarities of indi- 
vidual repeats to perfect c<tnsens!is se(|uences as compared 
to similarities between repeats theinseUes [7|. According to 
Fi^;ure I, repeats 37-52^ similar to each other will be 
SS-liV/f similar to their perfect consensus seipicnces. 
Without such improvement in similarities, the search lor 
diverse repeats and other biolo*»icalIv meaninj;ful se(|uence 
comparisons may be counterpnKluctive. 

Figure 1 



y 




X 



Cmreni Opinion n Struciuril Bioloqy 



The similarities between a source gene and its repeats as a function of 
the similarities between the repeats. The x variable indicates the 
average similarity between repeats sharing a common source gene: y 
represents the average similarity of repeats to their source gene that 
can be approximated by a consensus sequence. For example, repeats 
that are on average 50% similar to each other will be >68% sinnilar to 
their ideal consensus sequence. Adapted with permission from 17] . 



One can reconstruct ancestral 'I'Ks even with limited 
se(iuence data, especially if individual copies are m>t very 
dix erse. Additional information may be taken into account, 
such us the hif;h mutability of (;p(; dinucleotides or the 
presence of open readin;^ frames in which nonsense muta- 
tions can be re\ ersed. This has been dramatically demon- 
strated for the 7/7-like HXA transposon from fish, named 
Shrfiifr^^ lUauty. w hose transposase was reconstriictetl from 
a do/.en inactive copies. Its acti\ ity has been demi>nstrat- 
ed not only in the fish from which it originated, but also in 
human 1 leLa cells [«*'|- 'I'his work, and an earlier study 



18069302 



334 Sequences and topology 



dcm«»nsr racing the transfer of u man tier element from 
iynm[iliil(i to i j'hfimaff 'uf (9**1, arc inipfmunt steps towards 
application of DNA transposons in genomic studies. 

Reconstructions of TKs are very hihtjr intensive and 
retpiire biological insif^ht hut they often remain impub- 
lishcd. In order to promote the dissemination of this infor- 
mation and to credit the individual effort that ^oes into 
producing it, a new electronic publication entitled 
Rephase I'pdate was established (lO*!. Repbase I'pdatc 
represents a systematic attempt to integrate C(»nsensus 
sc(iuence data, nomenclature, biol(»|^ical classification and 
other rele\ ant information int(* a colicrcni resource neces- 
sary for seipience studies. To date, over *J5() difterent 

compiled fr(mi all available cukaryotic secpicnce data (see 
Table 1 ). Of these, over HilO are interspersed repeats. Most 
interspersed repeats from vertebrates and plants (~S()%) 
have been assigned to one of the following major care- 
^;ories: non-Ion^; terminal repeat U.TR) retrotransposons or 
rctroposons also known as SINI%s and IJN'Ks, and l-TR- 
retrorransi>osons including retroviruses and ON A trans- 
posons. I he remaining nonplant, non vertebrate repeats 
come from \ery diverse species, ranging from prot<izoans 
to octopuses, and are temporarily collected under the arbi- 
trary name of 'invertebrates'. \w this group, the fraction of 
interspersed repeats assigned to a particular categorx- is sig- 
nificantly lower {M)-AWf ), mostly due to insuiTicient com- 
parative scijuencc data necessary for the construction of 
reliable consensus se<|uences. This group of repeats is 
expected to h<ild many 'missing links' in our understand- 
ing of the origin and e\-olution of 'l'l{s. 

lUmian and rodent secjuenccs can be screened against the 
most recent \ersion (»f Repbase [ pdate using public 
ser\ers [11,121. Repeat annotation and masking is recom- 
mended prit)r to ex(m ideniifieatitin [13,14| but Repbase 

Table 1 



The current content of Repbase Update. 



Type of repeats 


File name 


Number of 
(sub) families 


Human repeats 


humrep.ref 


284 


Alu subfamilies (primate) 


humsub.ref 


16 


Processed pseudogenes (human) 


pseudo.ref 


20 


Rodent repeats 


rodrep.ref 


157 


Other mammalian repeats 


mamrep.ref 


96 


Oiher vertebrate repeats 


vrtrep.ref 


74 


Plant repeats 


plnrep.ref 


87 


Invertebrate repeats 


invrep.ref 


222 


Simple repeals (microsatellites) 


simple.ref 


131 


Total 




1087 


Unique 




956 



Updated human and rodent collections are also available from public 
servers for the autonrratic annotation of DNA sequences [11.12). Recently 
computed proportions of repeats in the nonredundanl human sequence 
data are as follows: Alu (1 2.3%); UNEl (1 1 .9%); MIR (1 .6%); UNE2 
(2.1%); LTR retrotransposons and endogenous retroviruses (5.6%); DNA 
transposons (1 .8%); simple repeats (1 .4%); other -0.35%. 



rpgrade is increasingly bein^ used for the direct studies of 
repetitive DNA. 

The genomic fossil record 

The genomic fossil record of past retrop<»sitions can be of 
^rcat value not only fj>r studies of TKs themselves, but also 
for population and phylof;enetic studies of their hosts. Vux 
example, youn^^ Alu (SINK) subfamilies ha\e been useful 
for human population studies, 'lb date, there arc five 
known Alu subfamilies (^'al . ^'a.S, Yb.S, VaS and VbS) acti\ e- 
ly proliferating in humans |1(),15|. Recent innovative stud- 
ies of 57 Ya5 Alu secpiences, 1.^ of whicb are polymorphic 
in the human k^mic pool, led to an estimate of himian efVec- 
tive population si/e us inji coalescence theory |U^"|. This is 

rK.. - : : t' u... 1..-:-.- i: - 

based (»n Alu retrop(»sition. 

Turning to older short interspersed nuclear element 
(SINK) families in mammals. Okada's ^roup |I7'*| 
obtained a phyhij^enetic resolution of the h)n^ disputed 
relationship amon^ whales, ruminants, hippopotamuses 
and pijis. They have shown that two SINI\ families, called 
(;HR-1 and (iIIR-2, are present exclusi\ely in the 
genomes of whales, ruminants and hip|>opotamuses, which 
toiicther form a monophyletic j^roup distinct fnun that of 
pij^s and camels. This finding c(»ntradicts pre\ ious phylo- 
;^enies and illustrates the powerful use of the genomic fos- 
sil record in complementing the paleontoloj^lcal record 
which is particularly difficult to obtain for whales. 

Another whale-related de\ elopmcnt w as the identificati(»n 
of homolo^iy between the basic units of ctmimon satellites 
and \A elements, representing; the most abundant LINK 
elements in manuiials .Satellites ha\e lon^ been 

viewed as a product of une(]ual crossin;^ over, however, 
there is no evidence that they can orij;inute fiv ntKo ftom 
nonfunctional 'junk' DNA. The homol<»;;y between IJ 
and these satellites supports this scenario and raises many 
interesting^ i|uesrions about satellite ami genomic e\oIij- 
tion. Another interesting; link between satellites and TKs 
is the honiolo^^y between the centromere-associated pro- 
tein (OKNP-H) and the fio^ao family of 'I'Ks althou.t;h bio- 
lo^jical interpretation of this fact remains tentative |I9,2()1. 

Retro (trans) position: a continuation of the 
transition from the RNA to the DNA world? 

Ver\ little is kn(»wn about the orij^in (»f TKs but it is con- 
ceivable that the ' Tl! world', can be traced all the way back 
to the bcf^innin^ of the transition from the hypothetical 
RN.Vbased i^enome to the DNA-based one. {'unn this 
point of view, the entire ^en<imie DNA mi^;ht ha\ e e\ ol\ ed 
with close participation of TKs, starting with retroposon-like 
elements. Many TKs mi^ht have evolved into panisites. par- 
ticularly those that can mi;irate between ditVerent hosts, but 
.some may still retain their original properties as ';;enomc 
builders\ The examples of /-)m\Y>/)>^//// non-LTR retroposons 
II e' I -A a n d ' I A R' I u h ich m a i n ta i n le I * )nie re s i n D/osop////// 
Ul'*,221. combined with the recently reported homolojjy 



18069302 

Repeats in genomic DNA: mining and meaning Jurka 335 



between tclome rases and reverse transcriptases |2.^",24**1, 
bring us closer t(» this brcjad perspective [25], 

In this context, it may be worthwhile to revisit recent 
research on the extensively studied mammalian LI 
(LINE!) elements. The orij^in <»f active mammalian LI 
elements remains obscure, but they have produced a suc- 
cession of numerous subfamilies during the past UK) mil- 
lion years or so [26], and they continue to be active at least 
in humans and rodents [27',2KI, In spite of their assumed 
'selfishness*, \A elements seem to exhibit some remnants 
of "altruistic' features that are e(»mpatible with active par- 
ticipation in genome cvcjiucion. '/'hey are responsible for 
adding over 24*X of the DNA to the human genome, only 
about half of which is LI DNA (see legend of Table I and 
(12]). l-nlikc other LINI% elements that are parasiri/ed by 
SINKs homologous to their 3' ends |2*^]. Lis apparently 
retropose a large variety of SINK elements and niRNAs 
see below) that have no obvious structural relation- 
ship to their own RNA, with the possible exception of 
p(i!y(A) tails [31). 'I'his is consistent with a recent study 
demonstrating the ability of LI re\ erse transcriptase to eftl- 
ciendy generate cDNA from RNA with no seciuence speci- 
ficity and including transcripts from cellular genes (32'1. 
Kven the affinity of LI reverse transcriptase for polyadeny- 
lated RNA hanging around the ribosomal system (311 rnay 
be interpreted as a remnant of the original participation of 
LI predecessors in the retroposition of protein encoding 
RNA. Another relevant property may be the ability of LI 
reverse transcriptase to heal chromosomal breaks, although 
there is some debate as to whether this cannot be attributed 
to nonhomologous recombination events 1.^3,.V4). 

Diversity and co-evolution of TEs 

The genomic fossil record deposited in eukaryotic 
genomes sh(»ws that autonomous TKs tend to be accom- 
panied by nonautonomous companions that are unable to 
proliferate themselves. Kxamples include transposon dele- 
tit >n fragments [.\S..V>|. SINK elements ht»mologous r(» ,V 
ends of LINK elements [29], and defective LTR retro- 
transposons, including defective endogenous retroviruses. 
To multiply, the first group must be able to use transptwase 
from intact DN.*\ transposons, .SINK proliferation depends 
on LINK-encoded reverse transcriptase and the remaining 
retroelemenrs probably rely on intact viruses for their 
reproduction. There may be a delicate balance between 
the autonomous and nonautonomous groups of TKs. anal- 
ogous to the balance between species in c(miplcx ecosys- 
tems. Autonomous elements prtjliferating out of contn*! 
may destroy their hosts. NonautononKKis elements may 
destroy themselves by ^successful' ct»m petition for the 
reverse transcriptase or irans|>osase produced by the 
autonomous TKs. Transposase titrati<)n by defective trans- 
|-M)S(ms has been discussed among possible factors for the 
restriction of the activity of mariner-like transposable ele- 
ments in natural populations (3b), although more special- 
ized mechanisms, such as overproduction inhibition, and 
missense mutation effects are \ iewed as more pmminent 



events in limiting proliferation of DN.A transposons. 
Multiple LINKl and SINK (Aiu, BL B2, BCl, etc.) sub- 
families in mammals may be viewed as examples of the 
ongoing co-evolution that is driven by c(»m petition for 
reverse transcriptase [26,.^()**.371. LINK2 and mammalian- 
wide interspersed repeat (MIR) elements |12| might have 
bectmie extinct as a residt of similar competition. Among 
general mechanisms for the restriction of TKs on the 
genomic side, suppression by (-pCi methylation and hete- 
rochromatini/ation have recently been discussed 14.3S,391. 
Overall, our knowledge of the mechanisms c(»nt rolling 
TKs at the genomic level is still fragmentary- 140]. 

(lo-evolution between auton«)mous and nonautcmomcms 
elements mav not be sufficient to acccumt for the diversity 
of endogenous retroviru.ses and retroviral-like elements in 
mammals. .Almost half of all the human repetiti\ e elements 
deposited in Repbase I 'pdaie |10*1 are either diverse LTRs 
or fragments of viruses and LTR retrotransposons. although 
they represent less than h7< of the human genome ( see leg- 
end of Table 1). In this context, it is wc^rth menticming a 
renewed interest in co-evolution between endogenous and 
exogencius retroviruses that could benefit the host [4K421. 
Other related possibilities include recurrent infections and 
recombinations between distantly related viruses (\\" 
Kapitom»v and J Jurka. unpublished data I. 

Targeting the mammalian genome 

Sei|uence analysis of target site duplications ( TSDs) of retro- 
posed elements from mammals [.^()'*I. combined with the 
independent disco\ er>" of the endoniicleolytic domain in LI 
rexerse transcriptase (Ll-KN. reviewed in 131]), hnnight 
about a recent breakthrough in our understanding t)f rerro- 
poson integration in mammals. The consensus se(|uence of 
rSDs and adjacent regions for LI. .Alu, IDiBCl), Bl, B2. 
and processed pseudogenes is ITI.W•V^^N),^_sT^"l NIR, 
where R denotes purines, Y represents pyrimidincs and N is 
any base. The \ertical bars show predicted positions of 
breakpoints on the oppcjsite strands of double-stranded 
DNA |3()",371. rriA^WA resembles consensus sequence 
nicked by the Ll-KN 143**1. an additi(mal argument impli- 
cating I A reverse transcriptase in the retroposition of nonau- 
tcmomous retro|>os<»ns. The general consensus sequence o( 
the l^Ds may combine different sulxlasses of targets. I-'or 
example, targets beginning with 11 IA(fA.\ are longer on 
axerage than the targets beginning with 'ITLV-VW (J Jurka. 
impublished data). Different target preferences may be relat- 
ed tt> different active Lis (27' |. 

The conserved set|uenccs aroimd both breakpoints in the 
consensus .se<iuence given ab(»ve appear to be different fmm 
each other, but separate analyses indicate that both 
setiuences are enriched with kinkable T.V (^\ and TCi din- 
ucleotide steps, which suggests a similar mechanism by 
which both breaks are generated |44*]. This mechanism may 
be of general significance since the kinkable dinncleotides 
are cimserved in targets both for DN.\ transix>scins and for 
insertion elements in bacteria |44*1. 



18069302 



336 Sequences and topology 



In jnalojry tu the iw»iicl of inicrj;raii*tn ufinsca K2 n(»n- 
\'A \< rctn»p()S(in |45|, the reverse iransLTiption of mam- 
malian rcrroposiiiis may he primed l>y the .V DNA ends 
e\jM>scd !>y nickinj;. Alrhtuij;h seH-prin^in;; *)!' retroposahle 
RNA has been reeently tlemonsrraretl /// z /'/ro |4(j|, irs rule 
in the retropositidii tif maniinalian retroposons may be 
niar;;inal if any. 

It has lonji been known that tloiible-scrantleil breaks stimii- 
hue homolo^^oiis reeonibinarion. 'Hierefore, l)\A targets 
exposed to I.l-K\ niekintc aeivity may be reeombinationa! 
hot spits in mammalian ;;en(»mes. This may have implica- 
tions for the undersrandinj; of at least some of tiie fraj;iie 
chromosomal sites involveil in the origin of .generic iliseases. 

Conclusions 

The rcx ersc flow (jf information from RNA to DNA mi^ht 
have hiid a definite he«;innin^ in rhe history of Hie, but it has 
ne\er ended. It remains an inte)i;ral parr of the on;;oini; 
genomic evoUition in eiikarvotic species. It is manifesreil in 
active retrnpnsons and in their fossil recoril as interspersed 
repetitive DNA. 'These are the major conch isions erne rj;in[; 
from recent pn>^ress in tlie field. Ikised on these eonchisions, 
the one-dimensional interpretation of TMs as ^parasites' or 
'selllsir elements should be transformed into a more bal- 
anced \ iew, with their dixerse roles comparable to the bio- 
lojiical roles of individual species in exolvin^ ecosysten)s. As 
the diverse world of 'VEs continues to enierjie with new 
se(|iieace data. TI-'.s are increusinf^iy bein^ explored in a 
broad ran.i^e of biolf)t^ical problems, from phylo^enetic and 
popuLition studies to;ienorne cnj^ineerin^. 

Acknowledgements 

M.irn <ititM;iiulin>: :iml rckviint oimriliniii»tis prinr tt» l'>'J7 luiihl imt he 
rc\iL\\cil here. I sckctcil ;i mnnUcr «>n>r»KKl icvciu ti. c»»n\iKuv.iic 

fur lIiiN ^iL-fiticiKy. I vvc.iiltl like tn ihiiitk \'l:tiliniir Kapilniinv. I'jiil 
KlnudUNki. Dnrnihy Miiiiru ;iml Jolatita Walieliicwiiv. lor help with etliiin;;; 
this rn;imiveri|>i. rhi\ ui»rk \\;is Mippuiicil I»> llie Xatitiniil liislitiilcs nt* 
llcalihjiram 1 l'41 LMU.i.Si. 

References and recommended reading 

Papers of particular interest, published wilhin the annual period of review, 
have been highlighted as: 

• of special interest 
•• of outstanding interest 

1 . Capy P: Ctassiftcation of transposable elements. In Molecular 
Biology intefligence Unit: Dynamics and Evolution of Transposable 
Bements. Edited by Capy P, Bazin C. Hiquet D, Langin X 
Georgetown. Texas: Landes Bioscience; 1 998:37-52. 

2. Brosius J, Tiedge H: Reverse transcriptase: mediator of genomic 
plasticity. Virus Genes 1996, 11 : 163- 179. 

3. Levin HL: If s prime time for reverse transcriptase. Cell 1 997. 8B:5>8. 

4. Kidwetl MG, Lisch 0: Transposable elements as sources of 
variation in animals and plants. fVoc Natl Acad Sci USA 1 997, 
94:7704-7711. 

5. Tomilin NV: Control of genes k>y mammalian retroposons. tnt Rev 
Cytol 1 996. in press. 

6. Chu WM, Ballard R. Carptck BW, NAfilliams BR, Schmid CW: 
Potential Alu function: regulation of the activity of double-stranded 
RNA-adivated kinase PKR. Moi Cell Biol 1998. 18:58-68. 

7 Jurka J: Approaches to identification and analysis of interspersed 
repetitive DNA sequences. In Automated DNA sequencing and 



analysis. Edited by Adams MD. Fields C. Venter JC. San Diego: 
Academic Press Incorporated: 1994:294-298. 

6. Ivies Z, Hackett PB, PJasierk RH, Izsvak Z: Molecular reconstruction 
of Steeping Beaufy, a Tcf-like transposon from Tish* and its 
trar>sposition in human cells. CeW 1997, 91:501-510. 
This important work is about the reconslmction of an active Iransposase from 
1 2 pseudogenes found in eight different fish species and using a modified 
consensus sequence. The approach used has implications for the recon- 
struction of other proteins involved in proliferation of transposable elements, 
for the engineering of new transposable elements, and for genome studies. 

9. Gueiros-Filho FJ, Beverley SM: Trans-kingdom transposition of the 
•• Drosophila element mariner within the protozoan Leishmania. 

Science 1997,276:1716-1719. 
The authors demonstrate the efficient transfer of the Drosophila mauritar)ia 
mariner element into the human parasite Leishmania major. This, and recent 
experiments with a reconstructed transposase [B**], clearly demonstrate Ihe 
feasibility of genetic studies on a wide variety of species using DNA 
transposable elements. 

1 0. Repbase Update 1 997 on World Wide Web URL: 
m htt n • V litJVAniLf g jf jf; st.or^/ ~* z^".'-t/ fcpb^c ch t " ! 

This is a collective att^pt lo organize the explosively growing number and 
variety of repelitive sequences. Repbase Update includes many consensus 
sequences of transposable elements and their biological characterization 
thai are unreported anywhere else. 

1 1 . Genetic Information Research Institute on the World Wide Web URL: 
http://charon.girinst.org 

1 2. Smit AFA: The origin of interspersed repeats in the human 
genome. Curr Opin Genet Dev 1996, 6:743-748. 

1 3. Burge C, Karlin S: Prediction of complete gene structures in 
human genomic DNA. J Mol Biol 1997, 268:78-94. 

1 4. Claverie JM: Computational methods for the identification of 
genes in vertebrate genomic sequences. Hum Mol Genet 1997 
6:1735-1744. 

1 5. Mighel! AJ. Markham AF, Robinson PA: Alu sequences. FEBS Lett 
1997.417:1-5. 

1 6. Sherry ST, Harpending HC, Batzer MA, Stoneking M: Alu evolution in 

• human populations: using the coalescent to estimate effective 
population size. Genetics 1997 147:1977-1982. 

This paper demonstrates a very interesting application of Alu polymorphism (or 
estimating human effective population size during the last 1 -2 million yeara 

1 7. Shimamura M, Yasue H, Ohshima K. Abe H, Kalo H. Kishiro T. Goto 
M. Munechika I, Okada N: Molecular evidence from retroposons 
that whales form a clade within even-toed ungulates. Nature 1 997, 
388:666-670. 

This paper addresses an important phylogenetic problem by innovative 
exploitation of selected repetitive sequences. This is a powerful example of 
how the genomic fossil record for some species can be more informative 
than the paleontological record. 

1 8. Kapitonov V, Holmqutst G. Jurka J: 11 repeat is a basic unit of 

• heterochromatin satellites in cetaceans. Mol Biol Evol 1 996. 
15:611-612. 

This work has important implications for Ihe understanding of the origin and 
evolution of satellite DNA. 

1 9. Halverson D, Baum M, Stryker J, Carbon J. Clarke L: A centromere 
DNA'binding protein from fission yeast affects chromosome 
segregation and has homology to human CENP-B. ) Cell Biol 
1997, 136:487-500. 

20. Kipling D. Warburton PE: Centromeres, CENP-B and Hggerr loo. 

Trends Genet 1997, 13:141-145. 

2 1 . Danilevskaya ON, Arkhipova IR. Traverse KL, Oardue ML: Promoting 
•• in tandem: the promoter for telomere transposon HeT-A and 

implications for the evolution of retroviral LTRs. Celt 1 997, 
68:647-655. 

This work shows that promoter activity in the retroposan HeT-A is located at 
its 3' end, in contrast to other retroposons. Tandemly arranged HeT-A 
elements share these 3' promoters with their downstream neighbors. The 
authors conclude that, because of its unusual structure, HeT-A resembles 
an evolutionary intermediate between non-LTR and LTR relrotransposons. 

22. Pardue ML, Danilevskaya ON. Traverse KL. Lowenhaupt K: 
Evolutionary links between tek»meres and transposable elements. 
Genetica 1997. 100:73-84. 

23. Ligner J. Hughes TR, Shevchenko A, Mann M. Lundblad V. Cech TR: 
Reverse transcriptase motifs in the catalytic subunit of 
telomerase. Science 1997, 276:561-567. 



18069302 

Repeats in genomic DNA: mining and meaning Jurka 337 



Tetomerase catalytic sobuntts were first identified in Euplotes aedicufatus 
and Saccharomyces cerev/siae, and were shown to contain reverse 
transcriptase motifs. This paper further demonstrates the fact that the 
reverse transcriptase motif is essential for normal chromosome telomere 
replication. This work brings together retroposition and chromosome 
maintenance and has profound evolutionary implications. 

24. Nakamura TM, Gregg BM, Chapman KB. Weinrich SL, Andrews WH, 
Ungner J. Harley CB, Cech TR: Telomerase catalytic subunit 
homologs from fission yeast and human. Science 1 997, 
277;955-959. 

This paper reveals that the catalytic subunits of telomerases [23**] have 
consented domains common to all reverse transcriptases. These domains 
also revealed distinct hallmarks and the authors conckjde that they 
represent a deep branch in the evolution of reverse transcriptases, and 
perhaps originated with the first eukaryote. 

25. Eickbush TH: Telomerase and retrotran&posons: which came first? 
Science 1 997, 277:91 1-912. 

26. Smit AFA, Toth G. Riggs AD, Jurka J: Ancestral mammalian-wide 
subfamilies of LINE- 1 repetitive sequences. J Mol Biol 1 995, 
246:401-417 

27 Sassaman DM. Dombroski BA, Moran JV, Kimberland ML, Naas TP, 

• DeBerardinis RJ. Gabriel K Swergold GD, Kazazian HH Jr: Many 
human LI elements are capable of retrotransposition. Nat Genet 
1997, 16:37-43. 

This paper estimates the number of active Li copies in the human genome. 
Different Lis may account for the presence of different targets for 
retroposon integration, as discussed in the review. 

28. Naas TP, DeBerardinis RJ. Moran JV. Ostertag EM. Kingsmore SF. 
Seldin MP. Hayashizaki Y, Martin SL, Kazazian HH Jr: An actively 
retrotransposing, novel subfamily of mouse Li elements. EMBO J 
1998, 17:590-597. 

29. Okada N. Hamada M, Ogrwara I, Ohshima K: SI NEs and LI N Es 
share common sequences: a review. Gene 1997 205:229-243. 

30. Jurka J: Sequence patterns indicate an enzymatic involvenr^nt in 
integration of mammalian retroposons. Proc Natl Acad Sci USA 
1997 94:1872-1877 

This paper shows for the first time that the integration of SINE, LI and 
processed retropseudogenes occurs at no n random, consensus-defined 
sequence targets. This strongly links the LI retroposition machinery to the 
proliferation of non-LINE retroposons and has implications for 
understanding of the mechanism of retroposition. 

31 . Boeke JD: LINEs and Alus - the polyA connection. Nat Genet 
1997 16:6-7 

32. Dhellin O. Maestro J, Heidmann T: Functional differences between 

• the human UNE retrotransposon and retroviral reverse 
transcriptases for in vivo mRNA reverse transcription. EMBO J 
1997 16:8590-6602. 

This paper demonstrates the specific and high efficiency of LI reverse 
transcription of RNA that has no sequence specificity. This is compatible 
with 'unselfish' aspects off Li previously discussed in this review. 



33. Teng SC, Kim B. Gabriel A: RetratFansposon reverse-transcriptase* 
mediated repair of chromosomal breaks. Nature 1 996. 383:641 -644. 

34. Lauermann V: DNA repair by recycling reverse transcripts. Nature 
1997 386:31-32. 

35. Vos JC, De Baere I, Plasterk RHA: Transposase is the only 
nematode protein required for in vitro transposition of Tel. Genes 
Oev' 1996, 10:755-761. 

36. HartI DL, Lozovskaya ER. Nurminsky 01. Lohe AR: What restricts the 
activity of mariner-like transposable elements? Trends Genet 
1997,13:197-201. 

37 Jurica J, Klonowski P: Integration of relroposable elements in 

mammals: selection of target sites. J Mol Evol 1 996. 43:685-689. 

38. Yoder JA, Walsh CP Bestor TH: Cytosine methyiaUon and the 
ecology of inlragenomic parasites. Trends Genet 1997 13:335-340. 

39. Bird A: Does Df4A methylation control transposition of selfish 
elements in the germline? Trends Genet 1 997 13:469-470. 

40. Labrador M, Gorces VG: Transposable element-host interactions: 
regulation of insertion and excision. Annu Rev Genet 1 997. 
31:381-404. 

41 . Van der Kuyl AC: Endogenous retrovirus sequences and their 
usefulness to the host Trends Microbiol 1 997, 5:339. 

42. Best S. Le Tissier PR, Stoye JP: Endogenous retroviruses and the 
evolution of resistance to retroviral infection. Trends Microbiol 
1997 5:313-318. 

43. Feng Q, Moran JV. Kazazian HH Jr. Boeke JD: Human Li 
retrotransposon encodes a conserved endonuclease required for 
retrotransposition. Cell 1996. 87:905-91 6. 

This breakthrough paper demonstrates the presence of an endonudeolytic 
domain in LI -encoded reverse transcriptase, implying that reverse 
transcription in mammals is primed by the 3' DNA ends that are exposed by 
nicking, as previously established in insects [45]. 

44. Jurita J, Klonowski R Triffonov EN: Mammalian retroposons integrate 
• at kinkable DNA sites. J Biomol Struct Dyn 1 998. 15:71 7-72 1 . 
Sequence data indicate that the integration of retroposons and other TEs 
may be associated with the formation of DNA kinks. This suggests the 
presence of universal structural features associated with the integration 
of TEs. 

45. Luan DD. Korman MH, Jakubczak JL. Eickbush TH: Reverse 
transcription of R2Bm RNA is primed by a nick at the 
chromosomal target site: a mechanism for non-LTR 
retrotransposition. Ceff 1993. 72:595-605. 

46. Shen MR. Brosius J, Deininger PL: BC1 RNA the transcript from a 
master gene for ID element amplification, is able to prime its own 
reverse transcription. Nucleic Acids Res 1 997 25:1 64 M 648. 



Genomics 58, 29-33 (1999) 

Article ID geno, 1999.5810, avaQable online at http://www.ldeallbraiy.com on I D E j^l 



Localization of Retina/Pineal-Expressed Sequences: Identification 
of Novel Candidate Genes for Inherited Retinal Disorders 

Melanie M. Sohocki,* Kimberly A. Malone,* Lori S. Sullivan,*'t and Stephen P. Daiger*'t'^ 

* Human Genetics Center, School of Public Health and T Department of Ophthalmology and Visual Science, 
The University of Texas Health Science Center, Houston, Texas 77225-0334 

Received December 10, 1998; accepted March 2, 1999 



More than 100 genes causing inherited retinal dis- 
eases have been mapped to chromosomal locations, 
but less than half of these genes have been cloned, 
ivlutations in ruan^ i-eiiiia/pineai-specific genes are 
known to cause inherited retinal diseases. Examples 
include mutations in arrestin, rhodopsin kinase, and 
the cone-rod homeobox gene, CRX, To identify addi- 
tional candidate genes for inherited retinal disorders, 
novel retina/pineal-expressed EST clusters were iden- 
tified from the TIGR Human Gene Index database and 
mapped to specific chromosomal sites. After known 
human gene sequences were excluded, and repeat se- 
quences were masked, 26 novel retina and pineal 
gland cDNA clusters were identified. The retinal ex- 
pression of each novel EST cluster was confirmed by 
PGR assay of a retinal cDNA library, and each cluster 
was localized in the genome using the GeneBridge 4.0 
radiation hybrid panel. In silico expression data from 
the TIGR database suggest that these EST clusters are 
retina/pineal-specific or predominantly expressed in 
these tissues. This combination of database analysis 
and laboratory investigation has localized several EST 
clusters that are potential candidates for genes caus- 
ing inherited retinopathy. O 1999 Academic Press 



INTRODUCTION 

Although more than 100 genes causing inherited 
retinal diseases have been mapped to chromosomal 
locations, less than half of these genes have been 
cloned (RetNet, http://www.sph.uth.tmc.edu/RetNet). 
Many of the mutations leading to inherited retinal 
disorders have been identified in genes that are ex- 
pressed predominantly in the retina and pineal gland. 
Photoreceptors and pinealocytes are developmentally 
related and also share expression of many genes in- 

Sequence data from this article have been deposited with the 
EMBL/GenBank Data Libraries under Accession Nos. G42173- 
G42198. 

* To whom correspondence should be addressed at Human Genet- 
ics Center, The University of Texas Health Science Center, P.O. Box 
20334, Houston, TX 77225-0334. Telephone: (713) 500-9829. Fax: 
(713) 500-0900. 



volved in phototransduction. Therefore, novel genes 
with expression patterns limited to these two tissues 
are potential candidates for inherited retinal disorders. 

The retina and nineal planH hnth nHcHnatp *imKnrrj- 
logically from the most anterior region of the neural 
plate, the diencephalon (GUbert, 1994), Development 
and differentiation of tJiese organs are also related, as 
many of the same developmental genes, such as the 
homeobox genes Xrxl (Casarosa et al, 1997) and Crx 
(Chen eta]., 1997), have expression patterns limited to 
the developing retina and pineal gland. Furthermore, 
mammalian pinealocytes are evolutionarily related to 
photoreceptor cells (Vollrath, 1985) and express a se- 
lective group of "retinal proteins" that are involved 
in the phototransduction cascade, such as rhodopsin 
kinase, phosphodiesterase, and transducin (Lolley et 
al, 1992). Neonatal pinealocytes express both "rod- 
specific" and "cone-specific" phototransduction compo- 
nents, and different subtypes of pinealocytes may ex- 
press varying combinations of these phototransduction 
enz3anes, similar to the different subtypes of photo- 
receptors in the retina (Blackshaw and Snyder, 1997) . 
Inherited retinal diseases have been associated with 
mutations in retina and piaeal gland transcription fac- 
tor genes, such as the cone-rod homeobox gene CRX 
(Freund et aL, 1997, 1998; Sohocki et al, 1998; Swain 
et al, 1998; Swain et al., 1997), as well as in genes 
involved in the phototransduction cascade, such as ar- 
restin (Fuchs etal., 1995; Nakazawa etal, 1998; Wada 
etaJ., 1996) and rhodopsin kinase (Khani etal, 1998; 
Yamamoto et al, 1997). 

The goal of this study was to identify novel retina/ 
pineal-specific EST clusters as potential candidate 
genes for inherited retinal disorders using a combina- 
tion of database analysis and laboratory investigation. 
Expressed sequence tags (ESTs) are partial cDNA se- 
quences that are being identified from tissue-specific 
cDNA libraries by large human genome centers and 
are deposited into databases, such as GenBank dbEST. 
The TIGR Human Gene Index database (http://www. 
tigr.org/tdb/hgi/hgi.html) lists assembled clusters of 
ESTs, which usually arise from the same transcript, 
and organizes these clusters according to tissue expres- 




29 



0888-7543/99 $30.00 
Copyright © 1999 by Academic Press 
Ail rights of reproduction in any form reserved. 



30 



SOHOCKI ET AL. 



sion. We identified EST clusters expressed in the ret- 
ina and pineal gland from the TIGR database, elimi- 
nated clusters that are expressed in additional tissues 
or represent known genes, and eliminated clusters 
composed of repeat sequences only. PGR primers were 
designed to the remaining 26 clusters and used to 
confirm retinal expression as well as to localize the 
gene encoding each EST cluster within the genome. At 
least 7 of the retina and pineal gland expressed genes 
identified in this study localize within the minimal 
candidate region of mapped inherited retinal diseases. 

MATERIALS AND METHODS 

Identification of retina and pineal gland clusters. The TIGR Hu- 
man Gene Index database release version 3.3 was searched on 
July \r 1S2G iOr EoT clusLers witii at lease iu% pineal transcripts. 
Only clusters including retina and pineal gland ESTs, or Including 
retina, pineal gland, and brain or cancer tumor ESTs, were stud- 
ied further. Any repeat sequences within a cluster were masked 
by the RepeatMasker program (http://ftp,genome.washington. 
edu/RM/RepeatMasker.html) before BLAST homology searches 
were performed (Altschul et aL, 1990) using the NCBI server 
(http://www.ncbi.nlm.nih.gov/BLAST/). Clusters identified by 
BLAST as representing known genes were excluded from the 
study, as were clusters identified by BLAST that include ESTs 
from tissues other than retina, pineal gland, brain, or cancer 
tissues. 

Localization of clusters and confirmation of retinal expression. 
Localization involved optimization of PGR primers, PGR prod- 
uct analysis to confirm identity, and radiation hybrid mapping. 
PGR primers were designed for an STS of each cluster using the 
Primer3 program (http:/Avww-genome.wi.mit.edu/cgi-bin/primer/ 
primer3.cgi). Primer pairs were optimized for PGR of human 
genomic DNA using a standard protocol of 35 cycles with AmpliTaq 
Gold polymerase (Perkin-Elmer) and an annealing temperature gra- 
dient generated in a Stratagene Robocycler thermocycler. The result- 
ing DNA fragments were separated on standard 2% agarose gels. If 
the resulting fragment was not of the expected size (indicating either 
an intervening intron or the wrong product), the fragment was 
treated with shrimp alkaline phosphatase and exonuclease (Amer- 
sham), followed by manual sequencing with the AmpliGycle Se- 
quencing Kit (Perkin-Elmer) and a primer end-labeled with ^P on a 
6% Long Ranger (FMC Bioproducts) denaturing acrylamide gel. 
Each cluster was localized in the genome by PGR assay with the 
same primers (using optimized conditions) in the GeneBridge 4.0 
radiation hybrid panel (Research Genetics). Results were submitted 
to the GeneBridge 4.0 mapping server at the Whitehead Institute 
(http://carbon.wi.mit.edu:8000/cgi- bin/con tig/rhmapper.pl) using a 
minimum lod score of 15 for placement. The resulting mapping 
information was then compared to the Stanford (http:/Avww-shgc. 
stanford.edu/Mapping/index.html) and Whitehead Institute (http:// 
carbon.wi.mlt.edu: 8000/cgi-bin/contig/phys_map) radiation hybrid 
maps for identification of the chromosomal band containing the gene 
encoding the cDNA cluster. Each cluster was then assayed for retinal 
expression by PGR in a human retina cDNA library kindly provided 
by Dr. Jeremy Nathans (Nathans and Hogness, 1984), followed by 
separation of products on a 2% agarose gel. The sequence, PGR 
primers, and amplification conditions for each STS developed in this 
study are available in GenBank and dbSTS (NGBI) (Table 2). 

RESULTS 

Identification of Retina/Pineal cDNA C J asters 

Retina and pineal gland cDNA clusters were selected 
for mapping by the following strategy. First, all clus- 



TAJBLE 1 

Retina/Pineal-Specific THC Clusters 
Representing Known Genes 



THG name 


Gene name, MIM" No. 


78331 


Interphotoreceptor retinoid-binding protein, 




IRBP, 180290 


86178 


Guanine nudeotide-binding protein. 




jS polypeptide 3, 139130 


100760 


cGMP phosphodiesterase, jS polypeptide, 180072 


164291 


Torsin B. DYTl, 128100 


166839 


Paired box homeotic protein 6, PAX6. 106210 


172410 


Synaptophysin p38, 313475 


175189 


Recoverin, 179618 


175673 


Transducin, y-subunit, 189970 


177643 


Retinoschisis protein, XLRSl, 312700 


213359 


Ghimaerin, ^2 glial fibrillary acidic protein. 




6U2857 


215703 


Guanylyl cyclase activating protein, 




GCAP, 600364 


216888 


Voltage-gated potassium channel. 




KCNBl, 600397 



"Mendelian Inheritance in Man (http://www.ncbl.nlm.nih.gov/ 
Omim/). 



ters with 10% or more pineal gland transcripts were 
identified in the TIGR Human Gene Index. After du- 
plicate clusters and clusters from the 5' and 3' ends of 
the same clone were eliminated, 1047 clusters or ESTs 
remained. The remaining clusters were then scanned 
for those with (i) one or more retinad-expressed ESTs 
but (ii) no ESTs from other tissues, except brain or 
cancer cells. Forty-five clusters containing retina and 
pineal gland ESTs remained. Twelve of these were 
excluded because they were found to represent known 
genes (Table 1). The remaining 33 clusters were then 
tested by BLAST analysis for highly similar sequences 
in the GenBank dbEST database. Four clusters 
CrHC233355, THC230448, THC201975, and THC229881) 
were excluded from further study because EST se- 
quences from other tissues were identified in dbEST. 
In addition, THC224189 was excluded because PGR 
primers for its assay could not be designed, as the 
majority of its sequence is AIu repeat sequences and 
there was no opposite end information for any of the 
cDNAs in the cluster. Two clusters, THC 133954 and 
THC 198187, were highly similar by BLAST analysis 
to genomic clones with known localizations: THC133954 
overlaps with 12PTEL055, which maps to 12pl3.3, and 
THC198187 overlaps with 425C14, which maps to 
6q22. The remaining 26 retina/pineal-specific clusters 
were judged to represent novel genes with unknown 
localizations. 

Localization of Clusters 

The primer pairs for the STS for each of the 26 
clusters were optimized in genomic DNA prior to PGR 
assay in the radiation hybrid panel. On optimization, 
the STS fragment for THCl 58983 was much larger 
than expected; however, sequencing revealed that the 



Localization of retina/pineal ests 31 



TABLE 2 

Retina/Pineal-Specific Clusters Mapped in This Study 





THC 




No. of 


No. of 






Candidate tor 


Laboratory 


cluster 


GenBank 


pineal 


retinal 


Number of other 


Mapping 


inherited retinal 


ID 


name 


Accession No. 


ESTs 


ESTs 


ESTs 


location 


disorder^ 


MMSOl 


90422 


G42173 


3 


2 


0 


17pl3 


RP13 


MMS02 


90997 


G42196 


1 


1 


0 


12q24.1 




MMS03 


133968 


G42174 


1 


1 


1 infant brain 


9p21 




MMS04 


137161 


G42175 






0 


12pl3,31 




MMS05 


137267 


G42197 


1 


1 


0 


Xq21-q22 


OPAl (Kjer type) 


MMS06 


153932 


G42176 


1 


1 


1 cancer 


3q29 


MMS07 


154909 


G42177 


1 




0 


llq25 




MMS08 


157357 


G42178 


1 


1 


2 brain, 2 cancer 


9q22.3 




MMS09 


158470 


G42179 




16 


0 


llql3.3 


EVR 


MMSIO 


158983 


G42195 


1 


1 


0 


9q22.3 




MMSll 


160180 


G42180 


1 


1 


1 cancer 


6p23 




MUSI?. 


160504 


n.AO^Q^ 




4 




IpGG 




MMS13 


160521 


G42182 


1 


1 


0 


lp22.1 




MMS14 


163082 


G42183 


1 


2 


0 


5ql4 


WGNl/ERVR 


MMS15 


174321 


G42184 


3 


4 


2 brain 


10q22-3 




MMS16 


177310 


G42185 


2 


2 


3 brain 


19pl3.3 




MMS17 


177379 


G42186 


3 


4 


0 


19ql3 




MMS18 


180397 


G42187 


5 


2 


25 brain 


2q37 




MMSI9 


195887 


G42188 


1 


1 


0 


12ql3 




MMS20 


195934 


G42189 


1 


1 


0 


lq31.1 


RP12 


MMS21 


202304 


G42190 


6 


4 


23 brain 


19ql3.4 




MMS22 


207703 


G42191 


1 


2 


2 brain 


8p22 




MMS23 


210727 


G42192 


1 


1 


0 


10q26.1 




MMS24 


220430 


G42198 


1 


2 


0 


17pl3 


RP13 


MMS25 


229889 


G42193 


2 


2 


1 brain 


15q24.1 


RP,MR 


MMS26 


229891 


G42195 


1 


2 


0 


5q31 





' Candidates were mapped to published candidate region for these loci. RP13, retinitis pigmentosa 13 locus; OPAl, optic atrophy 1 locus; 
EVR exudative vitreoretinopathy; WGNl, Wagner syndrome; ERVR, erosive vltreoretinopathy; RP, MR refers to recently reported retinitis 
pigmentosa with mental retardation locus (Mitchell et aL, 1998). 



fragment included an intron flanked by coding se- 
quence that matched the predicted coding sequence for 
tills cDNA cluster. Table 2 presents the STS name, 
number of cDNAs of each type within the cluster, and 
chromosomal mapping location for each novel cluster 
mapped in this study. 

Confirmation of Retinal Expression 

As confirmation of retinal expression, the STS se- 
quences for each cluster were assayed by PCR in an 
adult retina cDNA library. Although some of the se- 
quences, such as MMSIO, MMS13, and MMS26, pro- 
duced only a weak amplification, one sequence, 
MMS05, failed to amplify from the library. 

DISCUSSION 

Many of the known mutations leading to nonsyn- 
dromic inherited retinal degeneration are located in 
genes with either retina-specific or retina/pineal-spe- 
cific expression patterns. Moreover, these mutations 
are usually found in genes whose expression in the 
retina is limited to the photoreceptors, such as rhodop- 
sin, peripherin, or CRX (Freund etal, 1997). However, 
identification of photoreceptor-expressed genes as can- 



didates for inherited retinal disorders has required 
tedious laboratory experiments such as in situ hybrid- 
ization. Because the pinealocytes and photoreceptors 
are developmentally and functionally related, this 
study focused on genes with expression limited to the 
pineal gland and retina, with the expectation that 
many of these will be expressed in photoreceptors. 
cDNA clusters that also included brain cDNAs were 
not excluded from study, as brain transcripts may be of 
pineal origin. In addition, cDNA clusters that also in- 
cluded cancer tissue transcripts were not excluded, 
because tumor cells may express transcripts not ex- 
pressed in the nontransformed tissue. 

Retinal expression of each novel retina and pineal 
gland cDNA cluster in this study was confirmed, with 
the exception of THC137267. The STS for this cluster 
failed to amplify from the retinal cDNA library, but 
amplified from the genomic DNA of the radiation hy- 
brid panel. TIGR lists a single adult retinal cDNA for 
this cluster, and it is likely that the cDNA was not from 
a gene normally transcribed in the retina. 

The cDNA clusters that were identified by this 
method as representing transcripts of known genes are 
described in Table 1. These findings prove the validity 
of this approach for identifying candidate genes for 



32 



SOHOCKI ET AL. 



inherited retinal disorders, because mutations of some 
of these genes, such as the retinoschisis protein and 
the paired homeobox gene 6 (PAX6), £ire associated 
with inherited retinal diseases. 

The 26 novel retina and pineal gland expressed 
"genes" that were identified and mapped in this study 
are shown in Table 2. The term genes is used loosely, as 
it is possible that STSs that map close to one another 
may be from the same gene, for example, MMSOl and 
MMS24 on chromosome 17 or MMS08 and MMSIO on 
chromosome 9. It is also possible that transcripts from 
tissues other than the retina and pineal gland may be 
identified later for some genes mapped in this study. 

Seven of the STSs localized in this study fall within 
the published candidate regions for mapped inherited 
retinal diseases as shown in Table 2. The phenotypes of 
these autosomal loci include dominant retinitis pig- 
mentosa (RP13, MIM No. 600059), recessive retinitis 
pigmentosa (RP12, MIM No. 600105), dominant optic 
atrophy (OPAl, MIM No. 165500), dominant familial 
exudative vitreoretinopathy (EVR, MIM No. 133780), 
and dominant Wagner syndrome or erosive vitreoreti- 
nopathy (WGNl/ERVR, MIM No. 143200). In addition, 
one of these seven genes mapped within the candidate 
region for recessive mental retardation and retinitis 
pigmentosa, recently assigned to 15q24 (Mitchell etal., 
1998). Further laboratory investigation, including full- 
length cDNA sequencing and genomic characteriza- 
tion, followed by analysis of DNA samples from af- 
fected family members, will be necessary to determine 
whether mutations in any of these candidate genes 
cause inherited retinal diseases. 

Subsequent to the completion of this study, the latest 
GeneMap of the Human Genome was released (http:// 
Avww.ncbi.nlm.nih.gov/genemap/). STSs reported in 
GeneMap*98 confirm mapping of eight of the genes 
mapped in this study. However, GeneMap'98 reports 
only brain or pineal transcripts for four of these: 
THC180397 (Unigene Hs.4822, brain), THC153932 
(WI-18114, pineal gland), THC202304 (Unigene 
Hs.6535, pineal gland and brain), and THC137267 
(SGC35226, pineal gland). The gene for one of these 
clusters, THC153932, maps within the candidate re- 
gion for dominant optic atrophy (OPAl); however, be- 
cause GeneMap does not include evidence of retinal 
expression for this gene, it might not be considered a 
candidate for a retinal disease based on GeneMap 
alone. The four remaining genes with mapping con- 
firmed by GeneMap'98 include retina or retina and 
brain ESTs, but are not located within candidate re- 
gions; they are THC207703 (Unigene Hs.l2513), 
THC157357 (WI-20494), THC137161 (Unigene Hs. 
64616). and THC195887 (stSG40815). 

In conclusion, we report the identification and local- 
ization of 26 novel retina/pineal gland-expressed genes 
by a combination of database analysis and laboratory 
techniques. The expression pattern of these genes sug- 
gests the possibility of expression in photoreceptors, 
which is the expression pattern of several genes known 



to cause inherited retinal disorders. Further, 7 of these 
genes are candidates for the cause of known inherited 
retinal diseases. The combined approach of database 
analysis and laboratory investigation, iricorpbrating 
recognition of the biological relationship between pho- 
toreceptors and pinealocytes, is an effective technique 
for identification of candidate genes for inherited reti- 
nal disorders. 

ACKNOWLEDGMENTS 

We thank Odessa L. June, Human Genetics Center, The Univer- 
sity of Texas Health Science Center, Houston for expert technical 
assistance. This work was supported by grants from the Foundation 
Fighting Blindness and the George Gund Foundation, by the William 
Stamps Parish Fund and the M.D. Anderson Foundation, by NIH 
Crarit vycM^ A2 afid b^'^ NIH-NEI National Instittitiona! Ser'.'icc 
Award EY07024'. 

REFERENCES 

Altschul, S. Gish W.. MUler, W., Meyers, E. W., and Lipman, D. J. 
(1990). Basic local alignment search tool. J. MoL Biol 215: 403-410. 

Blackshaw, S., and Snyder, S. H. (1997). Developmental expression 
pattern of phototransduction components in mammalian pineal 
implies a light-sensing function. J. NeuroscL 17: 8074-8082. 

Casarosa, S., Andreazzoli, M., Simeone, A., and Barsacchi, G, (1997). 
XrxL a novel Xenopus homeobox gene expressed during eye and 
pineal gland development. Mech. Dev. 61: 187-198. 

Chen, S., Wang, Q., Nie, Z., Sun, H„ Lennon, G., Copeland, N. G., 
Gilbert, D. J., Jenkins, N. A., and Zack, D. J. (1997). Crx, a novel 
Otv-llke paired-homeodomain protein, binds to and transactlvates 
photoreceptor cell -specific genes. Neuron 19: 1017-1030. 

Freund, C. L., Gregory- Evans, C. Y., Furukawa, T., Papaioannou, 
M., Looser, J., Ploder, L., Belllngham, J., and Mclnnes, R. R. 
(1997). Cone-rod dystrophy due to mutations in a novel photore^ 
ceptor-specific homeobox gene (CRX) essential for maintenance of 
the photoreceptor. Ce/y91: 543-553. 

Freund. C. L.. Wang, Q. L., Chen, S., Muskat, B. L., Sheffield, V. C, 
Jacobson, S. G., Mclnnes, R. R., etal (1998). i?e mutations in 
the CRX homeobox gene associated with Leber congenital amau- 
rosis. Nat Genet 18: 1-2. 

Fuchs, S., Nakazawa, M., Maw, M., Tamai, M., Oguchi, Y., and Gal, 
A. (1995). A homo2ygous 1-base pair deletion in the arrestln gene 
is a frequent cause of Oguchi disease in Japanese, Nat Genet 10: 
360-362. 

Gilbert, S. F. (1994), "Developmental Biology," 4th ed., Slnauer As- 
sociates, Sunderland, MA. 

Khani, S. C, Nielsen, L., and Vogt, T. M. (1998). Biochemical evi- 
dence for pathogenicity of rhodopsin kinase mutations correlated 
with the Oguchi form of congenital stationary night blindness. 
Proc. Natl. Acad. ScL USA 95: 2824-2827. 

Lolley, R. N., Craft, C. M., and Lee, R. H. (1992). Photoreceptors of 
the retina and pinealocytes share common components of signal 
transduction. Neurochem. Res. 17: 81-89. 

Mitchell, S. J., McHale, D. P., Campbell, D. A., Lench, N. J., Mueller, 
R. F., Bundey, S. E., and Markham, A. F. (1998). A syndrome of 
severe mental retardation, spasticity, and tapetoretinal degeneration 
linked to chromosome 15q24. Am. J. Hum. Genet 62: 1070-1076. 

Nakazawa, M., Wada, Y., and Tamai, M. (1998). Arrestln gene mu- 
tations in autosomal recessive retinitis pigmentosa. Arch. Ophth. 
116: 498-501. 

Nathans, J„ and Hogness, D. S. (1984). Isolation and nucleotide 
sequence of the gene encoding human rhodopsin. Science 232: 
203-210. 



LOCALIZATION OF RETINA/PINEAL ESTs 



33 



Sohocki, M. M., SuUlvan, L. S., Mintz-Hlttner, H. A., Birch. D., 
Heckenlively, J. R., Freund, C. L., Mclnnes, R. R.. and Daiger, S. P. 
(1998). A range of clinical phenotypes associated with mutations In 
CRX, a photoreceptor transcription factor gene. Aw. J. Hum, 
Genet 63: 1307-1315. 

Swain, R K., Chen, S., Wang, Q., AfFaitago, L. M., Coats, C. L., 
Brady, K. D,. Flshman, G. A., Jacobson, S. G., Swaroop, A., Stone, 
E., Sieving, P. A. , and Zack, D. J. (1997) . Mutations in the cone-rod 
homeobox gene are associated with the cone-rod dystrophy photo- 
receptor degeneration. Neuron 19: 1329-1336. 



VoUrath, L. (1985). Mammalian pinealocytes: Ultrastructured as- 
pects and innervation. In *'Photoperiodism, Melatonin and the 
Pineal," pp. 9-17. Pitman, Avon, UK. 

Wada, Y., Nakazawa, M.. Fuchs. S., Gal. A., andTamai, M. (1996). 
Phenotyplc characteristics of patients with Oguchi's disease asso- 
ciated with frequent 1147delA mutation in the arrestin gene. In- 
vest Ophth, Vis. Sci. 37: 995. 

Yamamoto, S., Slppel, K. C. Berson. E. L.. and Dryja, T. P. (1997). 
Defects in the rhodopsln kinase gene in the Oguchi form of sta- 
tionary night blindness. Nat Genet 15: 175-178. 



ScienceDirect - Genomics, Volume 58, Issue 1, Pages 1-111(15 May 1999) 



Page 1 of 4 





MM 


;:30Ut!in9lSiii 
















This Volume/Issue S(Go|^.s^iii|^||i|||^ 



Genomics 

Copyright © 2004 Elsevier Inc. All rights reserved 

Volume 58, Issue 1, Pages 1-111 (15 May 1999) 



[^|ssue Mstj 




: .iiffl l display checked docs | j | e-mail articles | j | S> export citations] ': 



I Q A Genome-wide Search for Linkage to Asthma • ARTICLE 

Pages 1-8 

Matthias Wjst, Guido Fischer, Thomas ImmervoU, Martin Jung, Kathrin Saar, Franz 
Rueschendorf, Andre Reis, Matthias Ulbrecht, Maria Gomolka, Elisabeth H. Weiss etal 
Abstract | Abstract + References | PDF fill K) 



iliili 




■ 



¥229 KV 



3. □ 



Acquisition of theJWPMethylation Imprint Occurs Differentially on the Parental 
Alleles during Spermatogenesis •ARTICLE 

Pages 18-28 

Tamara L. Davis, Jacquetta M. Trasler, Stuart B. Moss, Grace J. Yang and Marisa S. 
Bartolomei 

Abstract | Abstract + References | PDF (244 K) 



^ARTICLE 



^^• pisiraGf |:i^ A^^ 



5 □ Linkage Analysis Narrows the Critical Region for Oculodentodigital Dysplasia to 
Chromosome 6q22-q23 • ARTICLE 

Pages 34-40 



http://ww.sciencedirectxom/science?_ob=IssueURL&jockey=%23TOC%236809%23199^ 7/20/04 



ScienceDirect - Genomics, Volume 58, Issue 1, Pages 1-111 (15 May 1999) 



Page 2 of 4 



Simeon A. Boyadjiev, Ethylin Wang Jabs, Michele LaBuda, Joseph E. Jamal, Torberg 
Torbergsen, Louis J. Ptagek, n , R. Curtis Rogers, Rolf Nyberg-Hansen, Stein Opjordsmoen 
et al. 

Abstract | Abstract + References | PDF (115 K) 




7 Q Guinea Pig p53 mRNA: Identification of New Elements in Coding and Untranslated 
Regions and Their Functional and Evolutionary Implications •ARTICLE 

Pages 50-64 

A. IvL D'Erchia, G. rcsule, A. Tullo, C. Sacccrvc and E. Sbisa 
Abstract | Abstract + References | PDF (506 K) 



9 Q Cloning of a Novel Member of the Reticulon Gene Family (RTN3): Gene Structure 
and Chromosomal Localization to llql3 -ARTICLE 

Pages 73-81 

E. F. Moreira, C. J. Jaworski and I. R. Rodriguez 
Abstract | Abstract + References [ PDF (386 K) 




11. □ Genomic Organization and Chromosomal Location of the Mouse Vasoactive 
Intestinal Polypeptide 1 (VPACj) Receptor • SHORT COMMUNICATION 

Pages 90-93 

Hitoshi Hashimoto, Akiko Nishino, Norihito Shintani, Nami Hagihara, Neal G. Copeland, 
Nancy A. Jenkins, Kyohei Yamamoto, Toshio Matsuda, Takeshi Ishihara, Shigekazu 
Nagata and Akemichi Baba 
Abstract | Abstract + References | PDF (69 K) 




http ://www. sciencedirect.com/sci ence?_ob=IssueURL&Jockey=%23 TOC%23 6809%23 199... 7/20/04 



ScienceDirect - Genomics, Volume 58, Issue 1, Pages 1-1 11(15 May 1999) 



Page 3 of 4 



13 □ Bestrophin Gene Mutations in Patients with Best Vitelliform Macular Dystrophy • 
SHORT COMMUNICATION 

Pages 98-101 

Germaine M. Caldwell, Laura E, Kakuk, Irina B. Griesinger, Stacey A. Simpson, Norma J. 
Nowak, Kent W. Small, Irene H. Maumenee, Philip J. Rosenfeld, Paul A. Sieving, Thomas 
B. Shows and Radha Ayyagari 
Abstract | Abstract + References | PDF (48 K) 




15 □ The BTRC Gene, Encoding a Human F-BoxAVD40-Repeat Protein, Maps to 
Chromosome 10q24-q25 • SHORT COMMUNICATION 

Pages 104-105 

Tsutomu Fujiwara, Mikio Suzuki, Akira Tanigami, Tsuneo Ikenoue, Masao Omata, 

Tomoki Chiba and Keiji Tanaka 

Abstract | Abstract + References | PDF (43 K) 



17 Q Analysis of Distribution in the Human, Pig, and Rat Genomes Points toward a 
General Subtelomeric Origin of Minisatellite Structures: Volume 52, Number 1 
(1998), pages 62-71 -ERRATUM 

Pages 109-110 

Valerie Amarger, Dominique Gauguier, Martine Yerle, Fran9oise Apiou, Philippe Pinton, 
Fabienne Giraudeau, Sylvaine Monfouilloux, Mark Lathrop, Bernard Dutrillaux, Jerome 
Buard and Gilles Vergnaud 
Abstract | PDF (24 K) 




http://www.sciencedirectxom/science?_ob=IssueURL&Jockey==%23TO 7/20/04 



i 



ScienceDirect - Genomics, Volume 58, Issue 1, Pages 1-1 1 1 (15 May 1999) 



Page 4 of 4 



Feedback | Terms & Conditions | Privacy Policy 

Copyright © 2004 Elsevier B.V. All rights reserved. ScienceDirect® Is a registered trademark of" Elsevier B.V. 



http .//www. sciencedirect. com/science?_ob=IssueURL&_tockey=%23 TOC%23 6809%23 1 99 . . . 7/20/04 





fATENT AND TRADEMARK OFFICE 



UNITED STATES DEPARTMENT OF COMPTER CE 
United States Patent and Trademark Office 
A^lL!^e^^; COMMISSIONER FOR PATENTS 
P.O. Biix 1450 

AlexnmlHii, Virginiii 22313-1450 

WWW.lI.'.plO.gllV 



APPLICATION NO. 


FILING DATE 


FIRST NAMED INVENTOR 


ATTORNEY DOCKET NO. 


CONFIRMATION NO. 


09/933,528 


08/20/2001 


Christophe Person 


LXGN-00104 


S324 



7590 07/23/2004 

C. Steven McDaniel, Esq. 
McDaniel & Associates, P.C. 
P.O. Box 2244 
Austin, TX 78768-2244 



EXAMINER 



BRUSCA, JOHN S 



ART UNIT 



PAPER NUMBER 



163i 

DATE MAILED: 07/23/2004 



Please find below and/or attached an Office communication concerning this application or proceeding. 



PTO-90C (Rev. 10/03) 




ffice Action Summary 



Application No. 



John S. Brusca 



Examiner 



09/933.528 



Applicant(s) 



Art Unit 



PERSON, CHRISTOPHE 



1631 



- The MAILING DATE of this communication appears on the cover sheet with the correspondence address - 
Period for Reply 

A SHORTENED STATUTORY PERIOD FOR REPLY IS SET TO EXPIRE 3 MONTH(S) FROM 
THE MAILING DATE OF THIS COMMUNICATION. 

- Extensions of time may be available under the provisions of 37 CFR 1 .136(a). In no event, however, may a reply be timely filed 
after SIX (6) MONTHS from the mailing date of this communication. 

- If the period for reply specified above is less than thirty (30) days, a reply within the statutory minimum of thirty (30) days wilt be considered timely. 

- If NO period for reply is specified above, the maximum statutory period will apply and will expire SIX (6) MONTHS from the mailing date of this communication. 

- Failure to reply within the set or extended period for reply will, by statute, cause the application to become ABANDONED (35 U.S.C. § 133). 
Any reply received by the Office later than three months after the mailing date of this communication, even if timely filed, may reduce any 
eamed patent term adjustment. See 37 CFR 1, 704(b). 



3) LJ Since this application is in condition fui* ailowanoe except for forrria! rr;attcrs, prccccjticn as tc the rr^erits is 

closed in accordance with the practice under Exparfe Quayle, 1935 CD. 11, 453 O.G. 213. 

Disposition of Ciaims 

4) 13 Claim(s) 2,3,5-33 and 39 is/are pending in the application. 

4a) Of the above claim(s) 39 is/are withdrawn from consideration. 

5) n Claim(s) is/are allowed. 

6) 13 Claim(s) 2.3 and 5-33 is/ ate rejected. 
?)□ Clainn(s) is/are objected to. 

8) n Claim(s) are subject to restriction and/or election requirement. 

Application Papers 

9) 13 The specification is objected to by the Exanniner. 

10) K The drawing(s) filed on 20 August 2001 is/are: a)^ accepted or b)D objected to by the Examiner. 

Applicant may not request that any objection to the drawlng(s) be held in abeyance. See 37 CFR 1.85(a). 
Replacement drawing sheet(s) including the correction is required if the drawing(s) is objected to. See 37 CFR 1.121(d). 

1 1) 0 The oath or declaration is objected to by the Examiner. Note the attached Office Action or fomri PTO-152. 

Priority under 35 U.S.C. § 119 

12) 0 Acknowledgment is made of a claim for foreign priority under 35 U.S.C. § 1 19(a)-(d) or (f). 
a)n All b)n Some * c)^ None of: 

1 Certified copies of the priority documents have been received. 

2. n Certified copies of the priority documents have been received in Application No, . 

3. D Copies of the certified copies of the priority documents have been received in this National Stage 

application from the International Bureau (PCT Rule 17.2(a)). 
* See the attached detailed Office action for a list of the certified copies not received. 



Status 



1)13 Responsive to communication{s) filed on 16 June 2004 . 
2a)n This action is FINAL. 2b)S This action is non-final. 



Attachment(s) 

1) 13 Notice of References Cited (PTO-892) 

2) mi Notice of Draftsperson's Patent Drawing Review (PTO-948) 

3) S Information Disclosure Statement(s) (PTO-1449 or PTO/SB/08) 



5) □ Notice of Informal Patent Application (PTO-152) 

6) □ Other: . 



4) CD Interview Summary (PTO-413) 



Paper No(s)/Mail Date. 



Paper No(s)/Mail Date 4/8/02 , 



U.S. Patent and Trademark Office 
PTOL-326 (Rev. 1-04) 



Office Action Summary 



Part of Paper No./Mail Date 20040720 




Application/Control Number: 09/933,528 Page 2 

Unit: 1631 

DETAILED ACTION 
Election/Restrictions 

1. Applicant's election of Group 2 in the reply filed on 16 June 2004 is acknowledged. 
Because applicant did not distinctly and specifically point out the supposed errors in the 
restriction requirement, the election has been treated as an election without traverse (MPEP 
§ 818.03(a)). 

2. Claim 39 is withdrawn fi-om further consideration pursuant to 37 CFR 1 . 142(b) as being 
drawn to a nonelected invention, there being no allowable generic or linking claim. Election was 
made without traverse in the reply filed on 16 June 2004. In the restriction requirement mailed 
claim 39 was omitted from nonelected Group 5, drawn to databases. Claim 39 is withdrawn in 
view of the election of Group 2. 

3. It is noted that the response filed 16 June 2004 contains a marked up copy of the claims 
as required by 37 CFR 1.121 and in addition contains an unnecessary unmarked copy of the 
claims that will not be considered to be the official copy of the claims, 

Prioriiy 

4. Applicant has not complied with one or more conditions for receiving the benefit of an 
earUer filing date under 3 5 U. S.C 11 9(e) as follows: 

An application in which the benefits of an earlier apphcation are desired must contain a 
specific reference to the prior appUcation(s) in the first sentence of the specification or in an 
apphcation data sheet (37 CFR 1.78(a)(2) and (a)(5)). The specific reference to any prior 
nonprovisional apphcation must include the relationship (i.e., continuation, divisional, or 



Application/Control Number: 09/933,528 Page 3 

Art Unit: 1631 

continuation-in-part) between the applications except when the reference is to a prior appUcation 
of a CPA assigned the same application number. 

It is apparent from the rule 63 Declaration filed on 1 1 December 2001 that the applicants 
intended to claim the benefit of U.S. Provisional AppUcation No. 60/227099. However until the 
specification is amended to refer to the above application no claim for benefit will be recognized. 

Specification 

5. The sequence listing and computer readable form filed 17 March 2003 have been entered 
into the appUcation history, 

6. This appUcation contains sequence disclosures that are encompassed by the definitions 
for nucleotide and/or amino acid sequences set forth in 37 CFR §§ 1.821(a)(1) and (a)(2). 
However, this application fails to comply with the requirements of 37 CFR §§ 1.821-1.825 for 
the foUowing reasons: 

Several nucleotide sequences appear in the specification in figure 3 that are not properly 
identified. Nucleotide sequences must be identified by sequence identification number. 
Furthermore, if said sequences do not appear in the sequence Usting, a new Usting including said 
sequences must be supplied. It is often convenient to identify sequences in figures by amending 
the Brief Description of the Drawings section (see MPEP 2422.02). If said sequences consist of a 
portion of sequences already of record in the sequence listing, they may be identified in the 
specification using the existing SEQ ID No. accompanied by the position of the sequence on the 
already Usted sequence. 

Applicants are required to comply with all the requirements of 37 CFR §§ 1.821-1.825. 
Any response to this Office Action which fails to meet aU of these requirements will be 



Application/Control Number: 09/933,528 Page 4 

ArtUnit: 1631 

considered non-responsive. The nature of the sequences disclosed in the instant application has 
allowed an examination on the merits, the results of which are communicated below. 

7. The specification is objected to as failing to provide proper antecedent basis for the 
claimed subject matter. See 37 CFR 1.75(d)(1) and MPEP § 608,01(o). Correction of the 
following is required: The subject matter of claims 10-15 and 17 do not have antecedent basis in 
the specification. 

Claim Rejections - 35 USC § 112 

8. The following is a quotation of the first paragraph of 35 U.S.C. 1 12: 

The specification shall contain a written description of the invention, and of the manner and process of making 
and using it, ia such full, clear, concise, and exact terms as to enable any person skilled in the art to which it 
pertains, or with which it is most nearly connected, to make and use the same and shall set forth the best mode 
contemplated by the inventor of carrying out his invention. 

9. Claim 17 is rejected under 35 U.S.C. 1 12, first paragraph, as failing to comply with the 
written description requirement. The claim(s) contains subject matter which was not described 
in the specification in such a way as to reasonably convey to one skilled in the relevant art that 
the inventor(sX at the time the application was filed, had possession of the claimed invention. 

Claim 17 is drawn to methods that use a database encoded in a biological medium. The 
specification does not describe databases encoded in a biological medium. 

10. The following is a quotation of the second paragraph of 35 U.S.C. 1 12: 

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the 
subject matter which the applicant regards as his invention. 

1 L Claims 5-16 are rejected under 35 U.S.C. 1 12, second paragraph, as being indefinite for 
failing to particularly point out and distinctly claim the subject matter which apphcant regards as 
the invention. 



Application/Control Number: 09/933,528 Page 5 

Art Unit: 1631 

Claims 5-16 are indefinite for recitation of the phrase "said sequences" because it is not 
clear which of the sequences in the claims fi"om which claims 5-16 depend the phrase refers to. 

aaim Rejections - 35 USC §103 

12. The following is a quotation of 35 U.S. C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subjecl ixiiilier an a wiiolc v/ould have been ob^'iou£ at Ihf- tiTne the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 

13. The factual inquiries set forth in Graham v. John Deere Co,, 383 U.S. 1, 148 USPQ 459 
(1966), that are appUed for establishing a background for determining obviousness under 35 
U.S.C. 103(a) are summarized as follows: 

1 . Determining the scope and contents of the prior art. 

2. Ascertaining the differences between the prior art and the claims at issue. 

3 . Resolving the level of ordinary skill in the pertinent art. 

4. Considering objective evidence present in the application indicating obviousness 
or nonobviousness. 

14. Claims 2, 3, 5, 7, 8, 18-20, 27, and 30 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Jurka et al. (1996). 

The claims are drawn to a method of making a repeat sequence database by masking 
repeat sequences in a query sequence wherein the repeat sequences are in a repeat sequence 
database, and determining if any remaining unmatched sequences in the query sequence are 
repeat sequences in a repeat sequence database, and if such repeat sequences are determined in 
the query sequence, the query repeat sequences so determined are added to a repeat sequence 
database. In some embodiments the right and left endpomts of the match are determined, the 
sequences are DNA sequences, the sequences are human sequences, the repeat sequence 



AppliGation/Control Number: 09/933,528 Page 6 

Art Unit: 1631 

databases are internet accessible and on computer-readable media, and the matching of 
sequences are performed by a database search algorithm. In some embodiments the search 
algorithm is a Smith Waterman algorithm. 

Jurka et al. (1996) shows in the program description on pages 1 19-121 a database 
matching program called CENSOR. CENSOR determines whether a query sequence contains 
repeats that match sequences in a repeat sequence database. CENSOR censors those repeat 
sequences so that the remaining query sequence may by matched against the ua-labase of choice 
without giving undesirable matches to repeat sequences that have been censored. Jurka et al. 
(1996) shows on page 119 that in the art the terms censor and masking are equivalent. Jurka et al. 
shows matching of query sequences that are DNA and determination of the right and left 
endpoints of the match and masked regions in figure 1. Jurka et al. (1996) shows human 
repetitive databases in the introduction on page 1 19. Jurka et al. (1996) shows computer-based 
repeat sequence databases throughout, and use of LOCAL, a Smith Waterman database search 
algorithm throughout. Jurka et al. shows on page 121 that one use of CENSOR is to allow for 
masking of repeated sequence followed by a second matching to a repeat sequence database 
using different parameters for possible identification and censoring of more distant repeats. Jurka 
et al. (1996) does not show addition of repeats identified by comparison of a masked query 
sequence to a repeat sequence database. 

It would have been obvious to a person of ordinary skill in the art at the time the 
invention was made to modify the method of Jurka et al. (1996) by addition of newly determined 
repeat sequences to a repeat sequence database so that the repeat sequence database would be a 
more comprehensive hsting of repeat sequences. 



Application/Control Number: 09/933,528 Page 7 

Art Unit: 1631 

15. Claims 2, 6, 15, 16, 19-24, 26-29, and 3 1-33 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, 
and further in view of Altschul et al. 

The claims are drawn to the method of claim 2 further limited to analysis of 
ribonucleotide sequences, sequences that encode amino acid sequences, synthetic DNA such as 
cDNA, repeat sequence databases accessible through the internet, use of public domain databases 
GenBank, dbEST, and SwissProt, use of seaiuh alguiiLhins BLAST and FAST.A^ and use of 
scoring matrices PAM and BLOSUM. 

Jurka et al. (1996) as appUed to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above does not 
show the method of claim 2 further limited to analysis of ribonucleotide sequences, sequences 
that encode amino acid sequences, repeat sequence databases accessible through the internet, use 
of public domain databases GenBank, dbEST, and SwissProt, use of search algorithms BLAST 
and FAST A, and use of scoring matrices PAM and BLOSUM. 

Altschul et al. reviews searching sequence databases. Altschul et al. shows searching 
query sequences derived from mRNA such as cDNA that encode proteins on page 119 and 
figures 2 and 3. Altschul et al. shows repeat sequence databases accessible through the intemet 
used to mask query sequences on page 128. Altschul et al. shows pubhc domain databases 
GenBank on page 124, SwissProt on page 127, and dbEST on page 128 (reference 60). Altschul 
et al. shows use of BLAST and FASTA search algorithms on page 120 and use of scoring 
matrices PAM and BLOSUM on pages 123-124. 

It would have been obvious to a person of ordinary skill in the art at the time the 
invention was made to modify the method of Jurka et al. (1996) as appUed to claims 2, 3, 5, 7, 8, 



Application/Control Number: 09/933,528 Page 8 

Art Unit: 1631 

18-20, 27, and 30 above by use of analysis of ribonucleotide sequences, sequences that encode 
amino acid sequences, repeat sequence databases accessible through the internet, use of public 
domain databases GenBank, dbEST, and SwissProt, use of search algorithms BLAST and 
FASTA, and use of scoring matrices PAM and BLOSUM because Altschul et al. shows use of 
all of those features in the context of searching sequence databases with query sequences whose 
repeat sequences have been masked. 

16, Claims 2, and 7-14 are rejected uuder 35 U.S.C. lG3(a) as being unpatentable over Jurka 
et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, and further m view of 
Jurka (1998). 

The claims are drawn to the method of claim 2 utilizing sequences from mice, plants, 
ftingi, and microorganisms. 

Jurka (1998) reviews repeat sequences from a variety of organisms. Jurka (1998) points 
to mouse repeat sequences on page 334 and table 1 . 

It would have been obvious to a person of ordinary skill in the art at the time the 
invention was made to modify the method of Jurka et al. (1996) as appUed to claims 2, 3, 5, 7, 8, 
18-20, 27, and 30 above by use of repeat sequences from a variety of organisms so that 
corresponding query sequences from the organisms could be analyzed and masked. 

17. Claims 2, 22, and 25 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Jurka et al. (1996) as appUed to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above, and fiirther in view 
of Sohocki et al. 

The claims are drawn to the method of claim 2 fiirther Umited to use of a TIGR database. 



Application/Control Number: 09/933,528 Page 9 

Art Unit: 1631 . , 

Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 18-20, 27, and 30 above cloes not . 
show use of a TIGR database. 

Sohocki et al. shows in the abstract and throughout use of the TIGR Human Gene Index 
database to search for genes for mherited retinal disorders. 

It would have been obvious to a person of ordinary skill in the art at the time the 
invention was made to modify the method of Jurka et al. (1996) as applied to claims 2, 3, 5, 7, 8, 
15-20, 27, aiiu 30 above by use of the TIGP. Hiim^.n Gene Index database because Sohocki et al. 
shows that the database is a useful source of human genes such as genes related to inherited 
retinal disorders. 

Conclusion 

18. Any inquiry of a general nature or relating to the status of this application or proceeding 
should be directed to (571) 272-0547. 

19. Patent apphcants with problems or questions regarding electronic images that can be 
viewed in the Patent AppUcation Information Retrieval system (PAIR) can now contact the 
USPTO's Patent Electronic Business Center (Patent EBC) for assistance. Representatives are 
available to answer your questions daily from 6 am to midnight (EST). The toll free number is 
(866) 217-9197. When calling please have your apphcation serial or patent number, the type of 
document you are having an image problem with, the number of pages and the specific nature of 
the problem. The Patent Electronic Business Center will notify apphcants of the resolution of 
the problem within 5-7 business days. Applicants can also check PAIR to confurm that the 
problem has been corrected. The USPTO's Patent Electronic Business Center is a complete 
service center supporting all patent business on the Internet. The USPTO's PAIR system 



Application/Control Number: 09/933,528 



Page 10 



Art Unit: 1631 

provides Internet-based access to patent application status and history information. It also 
enables applicants to view the scanned images of their own appUcation file folder(s) as well as 
general patent information available to the public. 

For all other customer support, please call the USPTO Call Center (UCC) at 800-786- 

9199. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Joiui S. Brusca v/hcse telephone rxiiinber is (571) 272-0714. The 
examiner can normally be reached on M-F 8:30-5:00. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Michael Woodward can be reached on (571) 272-0722. The fax phone number for 
the organization where this apphcation or proceeding is assigned is 703-872-9306. 

Information regarding the status of an appUcation may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or PubUc PAIR, Status information for unpublished 
appUcations is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 




John S. Brusca 
Primary Examiner 
Art Unit 1631 



jsb 



Nwce of References Cited 


Application/Control No. 
09/933,528 


Applicant(s)/Patent Under 
Reexamination 
PERSON. CHRISTOPHE 


Examiner 
John S. Brusca 


Art Unit 
1631 


Page 1 of 1 



U.S. PATENT DOCUMENTS 







, iJJP^Document Number 
i>OTtintry Code-Number-Kind Code 


Date 
MM-YYYY 


Name 


Classification 




A 


US- 










B 


US- 










C 


US- 










D 


US- 










E 


US- 










F 


US- 










G 


1 IC 

Uo- 










H 


US- 










1 


US- 










J 


US- 










K 


US- 










L 


US- 










M 


US- 









FOREIGN PATENT DOCUMENTS 



* 




Document Number 
Country Code-Number-Kind Code 


Date 
MM-YYYY 


Country 


Name 


Classification 




N 














0 














P 














Q 














R 














S 














T 













NON-PATENT DOCUMENTS 





* 




Include as applicable; Author, Title Date, Publisher, Edition or Volume, Pertinent Pages) 






u 


Jurka et al. CENSOR-A program for identification and elimination of repetitive elements from DNA sequences. Computers and 
Chemistry Vol. 20, pages 119-121 (1996) 






V 


Altschul et al. Issues in searching molecular sequence databases. Nature Genetics Vol. 6 pages 119-129 (1994) 






w 


Jurka Repeats in genomic DNA: mining and meaning. Cun-ent Opinion in Structural Biology Vol. 8 pages 333-337 (1998) 






X 


Sohocki et al. Localization of retina/pineal-expressed sequences: Identification of novel candidate genes for inherited retinal 
disorders. Genomics Vol. 58 pages 29-33 (1999) 



•A copy of this reference is not being fumished with this Office action. (See MPEP § 707.05(a).) 
Dates in MIVI-YYYY format are publication dates. Classifications may be US or foreign. 



U.S. Patent and Trademark Office 

PTO-892 (Rev. 01-2001) 



Notice of References Cited 



Part of Paper No. 20040720 



PNasB type a plus sign (+) insWe this box □ 



PT0/sart)8A (10-96) 
Approved tor use ttvough 10^31/99. OMB 0651-O031 
Patent and Trademark Office; U.S. DEPARTMEffT OF COMMERCE 



SutJSiitute for form 1449A/PTO / . -*\ 

INFORMATION TCCLOSU^ 
STATEMENT BYI^^U^T 

{use as many sheets as necessSfpf^'^ 


Complete If Known 


Application Number 




Filing Date 


August 20, 2001 


First Named Inventor 


Person, Chrtstophe 


GrouD Art Unit 


44arY^ Assigned 3 J 


^Examiner Name 




Sheet 1 1 


of 


1 


Attorney Docket Number 


LXGN-00104 



as. PATENT DOCUIWENTS 


Examiner 
Initials* 


Cite 
No.' 


U.S. Patent Document 


Name of Patentee or App&cant 
of Cited Document 


Date of Putjflcation 
of cited Document 
MM-OD-YYYY 


Pages. Columns, Unes, 

Where Relevant 
Passages or Relevant 
Raures Appear 


Number 

{ff known) 































pnnPWN PATEKT DOCUMEWTS 


Examiner 
Initials* 


Cits 
No.' 


Foreign Patent Document 


Name of Patentee or Applicant of Cited 
Document 


Date of 
Publication of 
cited Document 
MM-DO-YYYY 


Pages, Ootuniiu>, 

Unes, 
Where Relevant 
Passages or 

Dnlnii'M-i* 


V 


Office' 


Number- 

{if known) 
















Rgures Appear 























OTHER PRIOR ART - NON PATENT UTERATURE DOCUMENTS 


Examiner 
Initials' 


Cite 
No.' 


Indude name of the author (in CAPITAL LETTERS). Utie of the article (when appropriate), title of the 
item (tx)ok, magazine, journal, serial, symposium, catalog, etc.), date, page(s), volume-Issue number(s), 
publisher, dty aiwl/or country where published. 


r 




AA 


ALTSCHUL, STEPHEN F. ET AL, 1 900, "Basic Local Alignment Search TooT, J, Mol. Biol. 21 5:403-410. 






AB 


ALTSCHUL, STEPHEN F. Ef AL, 1997, "Xaapped BLAST and PSI-BLAST: a new generation of protein database search programs". Nucleic 
Adds Res. 25:3389-340^ 






AC 


FENG. F. ET AL, 1984-85, "Alining amino add sequences: comparison of commonly used metonds', J Mol Evd. 21 (2): 1 12-25. 






AD 


HENIKOFF S., and HENIKOFF, J.G., 1992. 'Amino add substitution matrices from protein blocks*, Proc Natl Acad Sd 
USA89(22):10915-9. 






AE 


KARLIN. S. and GHANDOUa G.. 1985. "Wuttiple-alphabet amino acid sequence comparisons of the immunogtobulin kappa-chain 
constant domain*. Proc Natl Acad Sd USA 82(24):B597-601. 






AF 


UPMAN, DAVID J, and PEARSON, W.R, 1985, 'Rapid and sensitive similarity searched, Sdence 227:1435-1441. 






AG 


PEARSON, W. and LIPMAN, DAVID, 1998, "Improved tools for biological sequence comparison". Proc. Natl Acad. Sd. 85:2444-2448. 






AH 


PEARSON, W.. 1990, "Rapid and sensitive sequence comparison with FASTP and FASTA". Methods in Enzymology 183:63-98. 






Al 


SMITH, T,F. and WATERMAN, M.S., 1981. Identification of common molecular subsequences'. J. Mol. Biol. 147:195- 
197. 





























Examiner 
Signature 



Date 

Considered 



•EXAMINER: InlUal rcJ6rence considered, whether or not dtation is in conformance vwlh MPEP 609. Draw line through dtation if not in conformance 
and not considered. Indude copy of Ihis form with next communication to applicanL 

* Unique dtation designation number. « See attached Kinds of U.S. Patent Documents. * Enter Office that issued the document, by the two-leller 
code (WlPO Standard ST.3). • For Japanese patent documents, the indication of the year of the reign of the Emperor must precede the serial 
number of the patent documenL ' Kind of document by the appropriate symbols as indicated on the document under WlPO Standard ST.l 6 if 
possible. • Applicant is to place a check mark here if English language Translation is attached. 

Burden Hour Statement: This fomt is estimated to take 2.0 hours to complete. Time will vary depending upon the needs of the individual case. Any 
comments on the amount of time you are required to complete this form should be sent to the Chief lnfomr\alion Officer, Patent and Tradenrtark Office, 
Washington, DC 20231. DO NOT SEND FEES OR COMPLETED FORMS TO THIS ADDRESS. SEND TO: Assistant Commissioner for Patents, 
Washington, DC 20231. 



